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FOREWORD 


Tus issue of the Review, the fifth devoted to methods of research and 
appraisal in education, follows the pattern of summary and review used 
in the third and fourth cycles published in December 1942 and 1945, 
respectively. Limitations on space allowed for various chapters have re- 
quired selective rather than complete discussion of all research studies 
published during the three-year period. Contributors to this issue have 
charted the significant trends, and have frequently indicated their critical 
estimates of studies. Their generous contribution is acknowledged. 

Trends of special importance in methods and technics of educational 
research which have become more apparent in the three-year cycle covered 
by this issue of the Review are presented in summary form in appropriate 
chapters. Bibliographic and documentary research has been characterized 
by enlarged and improved bibliographic guides and critical evaluation by 
experts of the canons of historiography. In diagnostic and synthetic studies 
of individuals, the trend has been toward the use of projective technics, 
including sociodrama and the nondirective interview, for research in per- 
sonality. In trend, survey, and evaluation studies increasing activity has 
occurred in the construction and application of informal technics to 
appraise the less tangible objectives of education and in follow-up studies. 
Research methods have shown an increasing use of appropriate logic of 
induction for the sampling used and more efficient experimental designs 
based upon analysis of variance methods. Observational methods of re- 
search have revealed improved technics in analysis of documents, rating 
methods, opinion surveys, and observational instruments and aids. In tests 
and measurements, advances were made in technical construction of tests, 
more adequate measurement of long-established objectives, and the dis- 
tinction between measurement of detailed subjectmatter content versus 
evaluation of general educational outcomes. Recent developments in sta- 
tistical theory emphasize statistical technics of factorial design rather than 
those applied to studies using the so-called “law of the single variable.” 
Articles on computational technics have increased in number rapidly during 
the three-year period covered by this review. 

One member of the committee indicated the following problems which 
should receive increased attention and effort from research workers in this 
field. Briefly, these problems are: (a) an analysis, description, and history 
of trend, survey, and evaluation studies in recent years; (b) a definition 
of appropriate methodology for evaluation; (c) an analysis of studies in 
education which deal with what is desirable or what should be done in 
educational research; and (d) a comprehensive and critical review of 
educational tests—perhaps an entire issue of the Review devoted to this 
topic. These suggestions indicate some of the unfinished work that offers 
a challenge to research workers. 


J. Wayne Wricutstone, Chairman 
Committee on Methods of Research and Appraisal in Education 
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CHAPTER I 


Library Resources and Documentary Research 


CARTER V. GOOD 


Tae description of library resources, bibliographical technics, and docu- 
mentary research presented by Scates (71) in the December 1945 number 
of the Review and by Good (37) in the December 1942 issue is carried 
forward in this chapter. The topics treated include: (a) library functions, 
organization, cooperation, and resources; (b) general treatises or manuals 
on library aids and technics; (c) guides to periodicals and books; (d) 
encyclopedias and dictionaries; (e) guides to theses and selected research 
projects; (f) serial and occasional bibliographies and summaries; (g) 
institutional and biographical directories or handbooks; and (h) histori- 
ography in history proper, education, psychology, sociology, philosophy, 
science, and related fields. 


Library Functions, Organization, Cooperation, and Resources 


Butler and others (17) analyzed the reference functions of the public, 
school, college and university, and research libraries, with special attention 
to problems in art and music, map collections, social sciences, science and 
technology, the humanities, administration, and in personnel and training. 
Wilson and Tauber (82) described established practice in library organi- 
zation and administration, with a few critical comments on controversial 
issues or practices. Davidson and Kuhlman (25) outlined briefly the grow- 
ing movement toward development of library and research resources in 
certain cooperative university centers of the South. Evans (31) described 
the national bibliographical services of the Library of Congress thru its 
facilities, collections, and experience, which have developed over a period 
of a century and a half. 

Downs (27) emphasized the fact that international understanding is 
based on free interchange of the cultural records of nations as collected 
and preserved by libraries, educational institutions, and cultural organi- 
zations in general. International exchange of resources involves a number 
of possibilities and problems such as (a) exchanges between institutions; 
(b) exchange of government publications; (c) national bibliography; 
(d) reproduction of research materials; (e) coordination of book acqui- 
sitions; (f) reconstruction of foreign libraries; (g) copyright, tariff, and 
postal regulations; (h) translations; (i) exhibits; and (j) interchange of 
personnel. 

Verdoorn (79) defined the chief aims of international scientific co- 
operation as exchange of information, attainment of objectives which 
scientists of a single institution or nation cannot accomplish alone, and — 
development of an esprit de corps which may counteract the evils of human 
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international politics. He identified the means by which these aims may be 
approached as: (a) various forms of cooperative research, (b) inter- 
national conferences and congresses, (c) activities of international com- 
missions and committees responsible for the solution of specific problems, 
(d) personal contacts, visits, and correspondence, (e) exchanges of re- 
search materials, specimens, and literature, (f) exchanges of professors, 
research workers, and students, and (g) various publications (for example, 
scientific journals, abstracting journals, textbooks and reference books, 
popular books and journals, directories, and bibliographies and indexes) . 


General Treatises or Manuals on Library Aids and Technics 


Winchell (83, 84) prepared the third and fourth supplements to the 
sixth edition of Mudge’s Guide to Reference Books (62), covering periodi- 
cals, newspapers, societies, essays and general literature, dissertations and 
research projects, encyclopedias, dictionaries, philosophy and psychology, 
religion, social sciences, science, useful arts, fine arts, literature, biography, 
geography, history, government documents, and bibliography. Barton (7) 
prepared a brief guide to reference books, and Bryan (16) an even smaller 
key to professional information, intended primarily for teachers. The third 
edition of Alexander’s standard text and reference book on educational 
information and data is scheduled to appear in the near future. 


Guides to Periodicals and Books 


Beginning with the educational books of 1947, an annual compilation 
of outstanding publications appeared in the Phi Delta Kappan (20), after 
having been published for many years in School and Society; this annual 
list of educational books identifies sixty volumes adjudged especially 
significant. Witmer (86) compiled a list of important books in education 
for 1945 and 1946, and Bay (9) characterized briefly important books in 
science for the past century. 

The Bibliographic Index (35) cumulated for 1943-1946, and the fifth 
edition of Ulrich’s Periodicals Directory (78) appeared. Hirshberg and 
Melinat (49) selected those books and pamphlets, published for the most 
part during the past twenty years, believed to be most generally useful in 
libraries, with emphasis on serials, directories, bibliographies, and hand- 
books which have greatest potential reference use. This compilation is a 
simplified, practical approach to the complexity of government documents. 
and provides a key to the wealth of information found in federal govern- 
ment publications, especially for the selective depository libraries and the 
smaller public and college libraries. 


Encyclopedias and Dictionaries 


The forthcoming revised edition of the Encyclopedia of Educational 
Research (61) will be considerably revised and enlarged. Kaplan (52) 
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edited two volumes on guidance, not encyclopedic in the usual sense, but 
a compilation of discussions on major topics, interspersed with short 
articles, with emphasis on major occupations and the rise of aptitude and 
specific vocational tests. 

Carmichael (19) edited an advanced textbook or reference work on 
some of the most important aspects of research on human development 
and child psychology. The bibliographies are extensive, with emphasis on 
the period since 1933. This manual is a sequel to the Handbook of Child 
Psychology (64) edited by Murchison. 

Harriman (45) edited a volume of 100 signed articles, alphabetically 
arranged, on a miscellaneous variety of topics in the field of psychology. 
His title of “Encyclopedia” for this volume is misleading. Because of 
incomplete coverage of the subject, disproportionate treatment of subjects 
in comparison with their importance, and inadequate indexing, this volume 
will prove less satisfactory for reference purposes than the psychological 
handbooks and dictionaries. 

Thornton (74) presented a proposal for an encyclopedia of psychological 
information, to embody. the following features: 


(a) a complete index in one volume of all psychological topics; (b) 
reviews of all important areas in psychology, each review written by an 
authority in that area; (c) complete bibliographies on all important topics 
in psychology, with the most significant references starred; (d) cross- 
references between related areas; (e) provision for correction of errors 
by the readers as well as the editors; (f) development of a normative dic- 
tionary of terms to simplify the task of communication of ideas thru 
the encyclopedia; (g) frequent revision and supplementation of reviews, 
bibliographies, cross-references, and index in order to incorporate the 
results of new researches and to eliminate errors that have been discovered. 


Runes (68) edited a relatively brief volume, attempting to provide defi- 
nitions thruout the range of philosophic thought. Unusual in the annals 
of professional writing, ten contributors disapproved the work as pub- 
lished, altho a number of good individual articles and definitions were 


included. New dictionaries of psychology (46) and of social welfare (88) 
appeared. 


Guides to Theses and Selected Research Projects 


The annual compilation of Doctoral Dissertations Accepted by American 
Universities (77) continued under the editorship of Trotier. The list of 
doctors’ dissertations under way in education appeared in the Phi Delta 
Kappan (38, 39), after having been published for many years in the 
Journal of Educational Research. Other compilations of graduate disserta- 
tions covered: sociology (3, 4), history (2), political science (32), library 
science (22), and Negro education (55). Bowers (12) continued his 
annual census of research projects in sociology. 
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Serial and Occasional Bibliographies and Summaries 


The annual bibliography on the methodology of research appeared in 
the Phi Delta Kappan (41, 42), after publication over a period of years 
in the Journal of Educational Research. Other serial bibliographies or 
summaries covered were: reading (44, 76), teacher supply and demand 
(30), Negro education (56), and history and philosophy of science (70). 
Selected occasional bibliographies or summaries were published in the 
areas of guidance (48), mental measurements (47), character education 
(54), juvenile delinquency (18), and radio research (58). 


Institutional and Biographical Directories or Handbooks 


Standard guides to higher institutions included the fifth edition of 
American Universities and Colleges (15) and the second edition of Ameri- 
can Junior Colleges (11), as well as the College Blue Book (51). A Guide 
to Colleges, Universities, and Professional Schools in the United States 
(40) was prepared especially for returning veterans interested in higher 
education; it attempted a complete coverage of collegiate and professional 
institutions and was arranged in tabular form for ease of reference. 

Bates (8) described the origin, functions, and national contributions of 
scientific societies, based on publications of the societies, and on other 
sources such as biographies and memoirs. Visher (80) analyzed the char- 
acteristics of scientists starred in American Men of Science, and Hughes 
(50) made a study of graduate schools conferring the doctorate during 
recent years. 


Historiography and Historical Writing 


The years since World War II have been fruitful in terms of producing 
a variety of historical writing in education, psychology, and the social 
sciences. Certain of the references in this section deal with the method- 
ology of historical research, while others are concerned primarily with the 
historical backgrounds of such areas as education, psychology, sociology, 
and philosophy. 

Curti and his committee (73) arrived at a number of propositions re- 
lating to basic premises of inquiry, methodological precautions, desirable 
technics and principles, and relations with neighboring disciplines. Woody 
(87) outlined the conceptions and philosophies of history, as well as 
methods and technics of historical research, with illustrations and appli- 
cations of special interest to workers in education. 

Cone (24) emphasized that certain larger factors affect especially the 
disposition of written history, and also may influence rhetorical artifice. 
“These three major factors are purpose, content, and scope, all of which 
directly affect and sometimes determine the selection of facts, the organi- 
zation, arrangement, shaping, proportioning, composing, and adapting the 
materials (disposition), and the manner of writing (style) .” 
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Cohen (21), the logician, criticized sharply all monistic theories of 
historical change, but repudiated the skeptical view that, because no his- 
torical account can be complete, scientific history is impossible. 

Wittels (85) maintained that the economic explanation of history leaves 
a gap which psychology has to fill: 

An explosive part is played in historical events by unconscious defense 
mechanisms against bisexuality, father or mother fixation, sadism, masoch- 
ism, exhibitionism, and other instincts. The content of radicalism may 
suddenly swing to the opposite extreme: leftists might change to radical 
conservatives and vice versa, because of a blind inner urge. Revolutions, 
the origin of religions, cannot be explained by economic (materialistic) 


reasoning alone. Not only the “how” of historical developments is created 
by exceptional men but also the “what.” 


Garraghan’s (34) metaphysical philosophy of historical interpretation 
covered the meaning of history, nature and classification of sources, criti- 
cism and appraisal of sources, and synthesis and exposition in presentation 
of the narrative (36). Collingwood (23) traced the philosophy or idea of 
history from Herodotus to the present. Einstein (29) analyzed the meaning 
of change as it affects history, especially as illustrated in the practices of 
dictators; he spoke of power as the instrument of change and of history 
as its record. 

Other recent titles of books included in this area were: The Study and 
Teaching of American History (75), Methodology of the Social Sciences 
(53), The Use of History (66), Work and History (72), The Greater 
Roman Historians (57), Protohistory (89), In Defense of Materialism 
(65), and This Thing Called History (81). 

Brennan (13) wrote from the standpoint of a Thomist, seeking to 
achieve in psychology a reconciliation between science and metaphysics. 
Dennis edited a book, Readings in the History of Psychology (26). 

Barnes and others (6) discussed social origins, ways of group life, and 
other aspects of sociology from Comte to Sorokin, covering the important 
countries of the world from the ancient Near East to the present. The 
second edition of The Development of Social Thought (10) appeared. 

Rugg (67), recognizing that the present disagreement among scholars 
stems from the conflict of three centuries between the philosophies of 
authority and of experience, took his stand with Peirce, James, Dewey, 
Veblen, Whitman, and O. W. Holmes, and analyzed the foundations of 
education in psychology, sociology, esthetics, and ethics. His historical 
treatment covered many aspects of cultural life in this country from 1890 
to the present. Broad historical treatments of the backgrounds of education 
were: Brubacher’s A History of the Problems of Education (14), Edwards 
and Richey’s The School in the American Social Order (28), Good’s A 
History of Western Education (43), Melvin’s Education: A History (59), 
and Mulhern’s A History of Education (63). 

Russell (69) sought “to exhibit philosophy as an integral part of social 
and political life: not as the isolated speculations of remarkable individuals 
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but as both an effect and a cause of the character of the various commu- 
nities in which different systems flourished. This purpose demands more 
of an account of general history than is usually given by historians of 
philosophy.” Other historical treatments of philosophy were by Miller 
(60), Fuller (33), and Gilson (36). 
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CHAPTER II 


Studies of Individuals 


RUTH M. STRANG and DEBORA PANSEGROUW 


I+ rr is true that “wars begin in the minds of men,” it is extremely impor- 
tant that we understand the dynamics of personality—-why people behave 
as they do. Personality is still a vast domain in which much research is 
needed. During the three-year period covered by this issue, however, a 
number of significant books on personality have been written. Murphy’s 
Personality, a Biosocial Approach to Origins and Structure (61) explores 
personality thru present research, not only in psychology but also in 
sociology, anthropology, and biology, and points the way to more signifi- 
cant research. Young likewise has a broad base for his study of Personality 
and Problems of Adjustment (98). Stagner has completely rewritten his 
Psychology of Personality (84), covering recent work in projective testing 
and supporting a point of view that emphasizes perception and the interior 
organization of experience. An important new anthology, Personality in 
Nature, Society and Culture (53), consisting of papers by thirty-seven 
authorities in the field, presents an orderly statement of our present knowl- 
edge of personality formation. The strictly experimental quantitative 
approach was taken by Eysenck in his Dimensions of Personality (32). 
Largely from researches carried on with 10,000 normal and neurotic sub- 
jects at a wartime mental hospital, Eysenck and his associates succeeded 
in factoring out and aescribing two main dimensions of personality, named, 
with reservations, “neuroticism” and “extraversion-introversion.” Similar 
in its quantitative experimental emphasis is Cattell’s Description and Meas- 
urement of Personality (18), in which he discussed the findings of pub- 
lished and unpublished studies and showed how clinical observation, rating 
of behavior, self-inventories, and objective tests have contributed to our 
understanding of the “factors, syndromes, and traits” of personality. Its 
emphasis is on methods of research. Entirely concerned with methodology 
is Rapaport, Gill, and Schafer’s second volume on Diagnostic Psychological 
Testing (70). 

Much of the research that would naturally be included in this chapter 
has been recently reviewed by Anderson and Embree in the Review for 
April 1948 (Vol. XVIII, No. 2) and in several chapters of the Review for 
February 1947 (Vol. XVII, No. 1) on Psychological Tests and Their Uses. 
The chapter by Havighurst, Kuhlen, and McGuire on “Personality Develop- 
ment” in the Review for December 1947 (Vol. XVII, No. 5) likewise 
contains much of significance on studies of individuals. 


Approaches to the Study of Individuals 


Three important trends in approaches to the study of personality are: 
(a) the broadening of the developmental approach, (b) the focus on the 
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dynamics of personality, and (c) the approach thru the study of inter- 
personal relations. Developmental studies of individuals have used a com- 
bination of technics and statistical methods. More precise instruments to 
use in quantitative experimental developmental studies are described in 
the second revised edition of Gesell and Amatruda’s Developmental 
Diagnosis (38). 

The increased use of projective technics, personal documents, and ver- 
batim reports of nondirective interviews in research on personality is 
evidence of the dynamic emphasis. Instead of focusing attention exclusively 
on the child’s overt behavior, the observer is concerned about the springs 
of behavior. According to Wolff (97), he should be guided by two major 
concepts: (a) that a child’s behavior is an expression of his “search for 
self,” and (b) that the child’s view of the world is quite different from 
the adult’s. 

Since an individual’s personality is often reflected in the behavior and 
attitudes of other persons toward him and his response to them, the study 
of interaction in groups and in different environments and cultures is an 
important approach to the study of individuals. 


Technics for the Study of Personality 


Because the description of personality is determined so largely by the 
instruments used, technics for research on personality are extremely 
important. For example, a concept of personality as a mosaic of separate 
traits and behavior is built by personality inventories and other instru- 
ments that collect only isolated details, whereas quite a different descrip- 
tion of personality is obtained thru technics that reveal the inner world 
and motives of persons. Each of the major technics will be briefly reviewed. 


New Uses of Psychometric Tests 


Psychometric tests are being increasingly used in a more flexible way. 
The research worker sometimes modifies the directions of standardized 
tests in order to study aspects of personality not revealed by test scores. 
For example, by adapting the administration of standardized tests to poorly 
adjusted individuals more valid IQ ratings may be obtained than when 
the directions for the test are rigidly followed (47). 

Qualitative analysis of intelligence test responses has been used to differ- 
entiate between mental defectives and normal pupils. Cruickshank (20) 
observed the reactions of children in these two groups to the most difficult 
arithmetic problem in the Binet test. The mental defectives were character- 
ized by their lack of autocriticism, unwillingness to admit inability to solve 
the problem, and blind manipulation of the numbers. In another experiment 


(88) the defective individuals were found to give superior responses on 


eleven items and inferior on eighteen items of the Binet test. 
Other verbal tasks have been used to study disorders of conceptual think- 
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ing (71). Only the normal adults were able to function at an abstract 
verbal level, and this ability differentiated them from children and from 
paretic and schizophrenic patients. Other evidence that psychometric tests 
can be used to study emotional adjustment was obtained by Despert and 
Pierce (25) in their study of the relationship between test results and the 
total records of preschool children. Psychometric tests are being increas- 
ingly employed to differentiate between normal and abnormal persons. 
Rabin (69) summarized research on the use of the Wechsler-Bellevue Scale 
for this purpose. 

An example of the diagnostic use of tests designed for another purpose 
was described by Klein (51). The procedure was for each cadet to estimate 
his performance before and after each of the six psychomotor tasks. This 
modified level-of-aspiration technic indicated that “cadets who overestimate 
their performance are more likely to fail in flying training than those who 
underestimate their performance.” The further extension of this type of 
study of test results has very definite possibilities for research. 

Another development in the use of psychometric tests for studying indi- 
viduals is the recognition of individual differences in the suitability of 
tests for different subjects. Vernon (91) found verbal intelligence tests 
very useful for testing high-grade persons such as officer candidates, but 
far less satisfactory for dull and poorly educated men. 


Personality Inventories and Questionnaires 


The rise and fall of personality inventories is indicated by the few 
references to the Bernreuter and the Bell inventories and many references 
to the Minnesota Multiphasic Personality Inventory (12, 56). Anyone 
planning to use personality questionnaires in a research study should read 
carefully Ellis’ review (29) which presents evidence of the low validity of 
these instruments. 

However, personality inventories have not been discarded for research 
purposes. Instead, they have been used for pattern analysis, i.e., the study 
of patterns of responses and personality profiles derived from them. For 
example, Gough (40) described a method of finding diagnostic patterns 
on the Minnesota Multiphasic Personality Inventory, and Meehl (58) has 
done significant work on profile analysis. The two main methods of inter- 
preting profiles are the statistical and the intuitive, the latter being at 
present superior to the former. As regards the former, Du Mas (27) found 
the Chi-square method preferable when time is not a factor. 

An attempt was made by Winthrop (96) to study personality integration 
by means of a paper-and-pencil test of attitude consistency. He found 
marked differences among college students in the consistency with which 
they responded to contradictory statements of attitudes. This inconsistency 
he attributed to “semantic blockage” and lack of a sense of logic and 
ability to see relationships. 

The question as to whether signatures on questionnaires of a personal 


384 





AANA TES A a RES AS ONE 


ov ateaartan 











OS a rt ee OE ele ee 


dh "gel aah aE aia 2 a 


December 1948 Stupies OF INDIVIDUALS 





nature have a substantial influence on their validity has been somewhat 
differently answered by various investigators. Altho the effect of signing 
seems slight in the majority of cases, it may be important when serious 
personal problems are being studied (22, 35). 

Jackson (48) compared paper-and-pencil tests, interviews, and ratings 
with respect to their effectiveness for evaluating personality. He found that 
paper-and-pencil tests and interviews were influenced least by “extraneous 
factors” such as school achievement and intelligence. Ratings were con- 
siderably influenced by these factors, more so in the case of teachers’ 
ratings than in the case of ratings by parents. 


Observation 


As a method of studying personality, observation has become increas- 
ingly important (15). Howie (45) described how observation could be 
used more effectively in the classroom by selecting traits that can be 
observed under school conditions, knowing the subjects for at least a year, 
systematically focusing on one trait at a time, and making the comparison 
of pupils more explicit by sorting procedures. Newman (65) described a 
procedure for observing adolescents in informal groups, using a “com- 
posite” scale of behavior patterns and other forms of behavior-rating scales. 





Sociometric Technics 


During the last three years much has been learned about relationships 
of individuals in groups by means of the sociometric technic. Sociometry 
is the method most frequently used in research on group dynamics and the 
interaction within school and community groups. 

The newer trend in sociometric study of individuals is to describe the 
personality of persons having different sociometric status, or “choice 
positions.” Northway (66) reviewed the Toronto studies on this subject. 
Bonney (11) reported factors related to mutual friendships. Other investi- 
gators (54) have found that ninth-grade pupils who on the sociometric 
test were least accepted by their classmates, mentioned more personal 
problems on the Mooney Problems Check List. 


Personal Documents 


Not many significant developments in the use of personal documents for 
research purposes have been reported during this three-year period. Combs 
(19) analyzed the autobiographies and T.A.T. protocols of forty-six univer- 
sity students and found much overlap in the material obtained by the two 
instruments. : 

Many valuable insights may be obtained from personal documents such 
as “Letters from Jenny” (50) and “Children’s Autobiographies” (68). 
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Expressive and Projective Technics 


The Rorschach method and other projective technics have been thoroly 
reviewed by Hertz, Ellis, and Symonds in the February 1947 issue of the 
Review and by Anderson in the April 1948 issue. Accordingly, references 
relating to research on the technics themselves will be omitted here and this 
chapter limited to trends in the use of projective technics in the study of 
individuals. 

There has been an enormous interest in projective technics to the neglect 
of certain other methods of studying the individual. Projective test re- 
sponses may be studied in a number of different ways. Clauses, sentences, 
paragraphs, or episodes may be analyzed and coded as in the standard 
procedure for the Rorschach and the T.A.T.; or a set of categories may 
be applied to the responses to a card or picture as a whole (37). An insight- 
ful synthesis may be made by clinical study of the total responses. 

Projective methods are being applied to everyday situations in which the 
person can make a relatively free, extended, and personal response. (This 
free response is carefully recorded. It may be elaborated by searching ques- 
tions as in “the inquiry,” and projective methods of interpretation are then 
applied.) An example of this wider application of projective methods is 
Sims’s description of “the essay examination as a projective technic” (81) 
and Munroe’s “use of projective methods in group testing” (60). 

Projective methods are being used increasingly with adolescents and 
adults. An example of this trend is the application of an adapted set of the 
Lowenthall “Little World” material, usually used with children, to 100 
adults (10). In response to the invitation to do as they pleased with the 
materials, the majority of the adults represented everyday life as it appeared 
to them. An analysis of their responses gave personality pictures of the 
individuals and an understanding of certain aspects of life in our culture. 
Correspondence with personality patterns of the same subjects obtained 
from biographies was high. 

Projective methods of administration and interpretation are being in- 
creasingly applied to standardized psychological tests. All psychological 
tests are projective insofar as they reveal the individual’s personality struc- 
ture and inner world. Various types of personality disorders are detected 
by studying, qualitatively as well as quantitatively, the successes and fail- 
ures, the scatter pattern, and scores on subtests of an intelligence scale 
or other standardized test. For example, the Wechsler-Bellevue test has 
been used as a “nonprojective personality test” to reveal impairment of 
mental functions and its effect on personality (80). The use of psychological 
tests for this purpose is based on the assumption that mental illness mani- 
fests itself in a person’s thought processes, as expressed in the test per- 
formance (70). 

Among the developments in methods of studying individuals by means 
of the Rorschach are the graphic method proposed by Hilden (44), the 
inspection technic and use of diagnostic signs developed by Munroe (59) 
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and by Buhler, Buhler, and Lefever (14), the multiple-choice group 
Rorschach which has not been generally satisfactory even as a screening 
device, and the supplementing of Rorschach responses by free associations 
that further aid in the qualitative interpretation (49). 

During the three-year period covered by this review, the second volume 
of Beck’s manual for the Rorschach test (6) has been published. 

More has been learned about the applicability of the Rorschach to chil- 
dren as young as three years (36), the personality patterns of old persons 
(52), the study of personality in preliterate cultures (55), and the per- 
sonality factors in reading disability (93). 

During the war, quick technics of measurement and statistical treatment 
of data were developed. Rorschach Standardization Studies (14) are in this 
direction. They aim to standardize a list of diagnostic Rorschach signs 
that would simplify scoring and interpretation. Some of these short-cuts 
can be used in civilian measurement. They should be critically examined, 
however, to be sure that they do not lead to superficial or erroneous 
descriptions of personality. Nor should these short methods be substituted 
for a much more significant analysis and integration of subjects’ total 
responses. 

Hertz, Ellis, and Symonds, in a chapter on the Rorschach method and 
other projective technics in the February 1947 Review, pointed out three 
dangerous trends: (a) oversimplification of administration, scoring, and 
interpretation; (b) modification of the method to allow use by untrained 
persons; and (c) use of group technics before adequate study of their 
effectiveness. 

In a comparative study of motivations as revealed in Thematic Apper- 
ception stories and autobiography, Combs (19) found considerable agree- 
ment with respect to underlying drives indicated by the two methods. The 
T.A.T., however, revealed more desires relative to the past and future, and 
also more socially undesirable desires. 

The projective technics, being based on material to which the subject 
has no ready-made responses, are less dependent on the culture than are 
tests having a large element of achievement. Accordingly, they are especially 
valuable for studying personality in different cultures. Henry (43) de- 
scribed the use of the T.A.T. with about one thousand Indian children in 
the study of culture-personality relations. 

In giving the T.A.T. to six mental patients and a close relative of each, 
Rosenzweig and Isham (77) obtained complementary material that sup- 
ported and extended the case history data and that emphasized areas of 
conflict between the patients and their intimate associates. 

Paintings and drawings, analyzed as to content and form (line, color, 
use of space, organization) have revealed personality characteristics similar 
to those made thru a study of the same children at home and at school (1). 
The drawings may be free and spontaneous or made in response to certain . 
directions. In one experiment (95) the personality sketches made on the 
basis of the drawings were recognized by the students’ teachers in 103 out 
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of 116 instances. In finger-painting, interpretation properly made gives 
clues as to the personality dynamics of the subjects and shows differences 
in age, sex, and background, The correspondence between observation and 
clinical diagnosis is reported to be high (62, 63). 

A different use of drawing is to ask subjects to draw what they felt were 
the most important events of their lives. Here drawing was used as a means 
of communication in the same way that verbal responses to a given stimulus 
might be used (30). 

Play technics, while used extensively for therapeutic purposes, also help 
to diagnose personality and show the dynamics of child behavior (3, 4). 
Differences between psychotic delinquent children and normally adjusted 
children with respect to relationships of children to father were revealed 
in a standardized projective doll play (5). 

The Rosenzweig picture-association is a method of studying the reactions 
of children and adults to frustration (75, 76). 

The incomplete sentence has been used for various purposes: as a screen- 
ing device and as a method of measuring improvement in adjustment. It 
seems to reveal anxieties and hostilities better than ratings and reports 
(74, 78, 86, 87). It yields information on (a) conflict or unhealthy re- 
sponses, (b) positive or healthy responses, and (c) neutral responses. 
Goodenough (39) presented the free association test as a technic of great 
value in giving “signs” that indicate personality structure. 


Interview 


As the interview has become more nondirective, its value for studying 
personality has increased. Sound-recorded interviews, in which the person 
tries to understand himself, supply important data for research on per- 
sonality (21, 73). Thus far, however, the interview has been used far more 
extensively for counseling and psychotherapy than for research. 


Case Study 


In their evaluation of the case study as a research method Symonds and 
Ellis stated in the Review for December 1945 that “the case study has been 
of increasing value to students of research in education, psychology, sociol- 
ogy, and anthropology; . . . progress has been made in the technics of 
gathering and treating case-study data for research purposes; and. . . 
case material has been employed in many significant investigations” (p. 
352). 

The case study, like the interview, has been used primarily for service 
rather than for research purposes. The bulk of published material consists 
of case histories and studies (16, 82). These case studies give a “realistic 
synthesis” of individuals. If more systematically collected, case studies can 
be used for research purposes. They have been used to study certain 
relationships such as that between symptoms of maladjustment and back- 
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ground factors (83) and differences between unselected students and 
students with emotional problems (57). The difficulty of interpreting a 
case study was highlighted by Elkin (28) who found wide disagreement 
on the part of thirty-nine persons who attempted to interpret the motives 
and adequacy of adjustment of the same case, and by Davis (23) who 
compared the value of the case study with data from mental tests for 
different purposes. He found the case study more important in clinical 
diagnosis and the tests in vocational guidance. 

As in other technics, attempts have been made to make a quantitative 
analysis of case records as the basis for generalizations about clients (9). 
This approach is in opposition to the view of individual cases as unique. 


New Methods of Diagnosing Personality 


Many new methods mostly of the projective type for diagnosing different 
aspects of personality have been proposed but not sufficiently validated. 
Among these are (a) a new perceptual test (2) in which simple letter 
combinations are exposed for 0.5 seconds to determine certain features 
of personality such as speed, accuracy, consistency, cautiousness or venture- 
someness, level of aspiration, and emotional disturbances of various types: 
(b) a picture interpretation personality test, similar in theory to the T.A.T. 
but dealing with different subjects such as the child in his various rela- 
tionships (13); a verbal projective technic in which the subject responds 
to the question: “tell me three things that are impossible” (26); (d) a 
modification of the Sargent test, which consists of briefly described situa- 
tions emphasizing the major conflict areas of the personality and followed 
by questions such as: What did he (she) do? Why? How did he (she) 
feel? (34); (e) the study of dreams as projective documents (42). 

Individuals may be studied thru their responses to annoyances. Bennett 
(7) found differences between neurotics and those having no record of 
neurotic behavior with respect to their specific sensitivity to noise and 
their general sensitivity to stimuli reminding them of their personal in- 
adequacy. The phenomenon of “autokinetic movement” has been shown 
to differentiate between various kinds of mental patients (94). Different 
individuals perceive, with high reliability, various patterns of movement 


when they look at a stationary pinpoint of light in an otherwise totally 
dark room. 


New Combinations of Methods 


Research using a combination of methods or a battery of tests is in- 
creasing. Many significant new developments are reported by the U. S. 
Office of Strategic Services in the significant report, Assessment of Men 
(90). New combinations of methods are being tried, as, for example, a. 
T.A.T. type of test with the sociodrama (33) in which the feelings de- 
scribed in the picture story are acted out in the role-playing situation, 
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thus fusing diagnosis with therapy. Some of these methods have been 
combined to study the relationships among physical, mental, and motor 
cl-aracteristics, as, for example, the use of the sentence-completion test 
with psychogalvanic response (17). 

In a camp situation a combination of observation, paper-and-pencil tests, 
and projective technics gave insight into the personality of individuals 
(31). The counselor’s observation and his judgment recorded on the 
Canter questionnaire showed how the campers behaved in the group; the 
Rogers Test of Personal Adjustment gave information about their wishes 
and likes; the Rorschach and analyses of paintings gave insight into their 
personality structure. 


Diagnostic Study of School Subjects 


Diagnostic methods in school subjects have been summarized in the 
December 1945, April 1946, and February 1947 issues of the Review; in 
yearbooks of the National Society for the Study of Education (64), and 
in Russell’s short review on reading disabilities and mental health (79). 
The most extensive research in the field of reading is Robinson’s report 
of her clinical study of thirty children (72). The analytical approach was 
represented by the description and classification of the major factors, and 
the synthetic approach by the reports of cases in which all the data 
obtained by a group of specialists were brought together. By discussing 
the diagnostic data on each child in a case conference, the most probable 
causes of the reading problems became clearer and a program of treatment 
was suggested. The major causes of failure in this study were found to 
be poor family interrelationships; visual anomalies; emotional problems; 
inadequate methods of teaching of reading in school; neurological, endo- 
crine, auditory-acuity, and other physical difficulties; and speech and 
functional auditory factors. 

In Gray’s “Summary of Reading Investigations” (41) few new methods 
of studying reading difficulties and development were reported. 

An indirect method of ascertaining reading maturity was described by 
Husband (46). He found that a preference for the precise, concentrated 
passages as against the loose, ambiguous selections in both poetry and 
prose was associated with high intelligence. From a battery of tests certain 
patterns of relationships seem to be characteristic of retarded readers, as, 
for example, with respect to associative learning and memory-span test 
findings (85). 

The use of instructional tests is an important means of evaluating 
growth in critical thinking, ability to get the author’s pattern of thought, 
and other broader phases of the language arts not usually included in 
standardized tests (67). 

Very little has been done in self-diagnosis and the client-centered 
approach in dealing with difficulties in school subjects. The cases reported 
have had the service rather than research emphasis. 
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An important practical trend is the application of diagnostic methods 
to everyday school tasks, i.e., diagnosis while teaching and learning. 
Closely allied to this trend to make diagnosis an intrinsic part of instruc- 
tion, is the use of informal instructional tests to collect valuable informa- 
tion about how children learn specific kinds of subjectmatter. 


New Developments and Trends 


Two main trends are discernible in the use of these technics for research 
purposes. One is the trend toward making the collection, interpretation, 
and treatment of the data more analytical and objective, more like the 
quantitative scores on standardized psychological tests. This tendency is 
shown by measurement of individuals’ characteristics by means of more 
numerous and specific objective tests (24). The same trend is shown in 
the attempts to objectify, analyze, categorize, and quantify case data. It 
is also shown in the efforts to devise definite scoring methods for the 
Rorschach, T.A.T., and other projective technics. 

A parallel trend is the insightful synthesis of comprehensive data— 
responses in projective technics, case study, and various types of tests and 
inventories. For example, the Office of Strategic Services (90) reported 
the successful use of observation of persons in a variety of situations 
unfamiliar to them. On the basis of three days’ observation of each person, 
trained observers obtained an understanding of the dynamics of personality 
that stood the test of success on the job (90). 

During the war, tests were used lavishly in appraising individuals’ 
qualifications for certain jobs. This war need has paved the way for the 
study of personality by means of batteries of highly-specialized tests. 
Research in this direction is represented by the Thurstone Tests of Primary 
Mental Abilities (89) and the Differential Aptitude Tests by Bennett, 
Seashore, and Wesman (8). 

Statistical methods of treating personal data obtained on Navy and 
Army personnel during the war have significant implications for present 
research, especially in the field of vocational selection (92). 

Research might be facilitated by quicker methods of administering and 
scoring. Along this line are the attempts to adapt the Rorschach, the T.A.T., 
the Minnesota multiphasic inventory, and other instruments to group 
methods of administration and multiple-choice or other objective-type 
responses, and to quick methods of scoring. These efforts, however, have 
not yet been successful, and may be in the wrong direction, away from 
clinical interpretation of whole responses. 

Perhaps the most important future development lies in the use of tech- 
nics such as the projective technics, the sociodrama, and the nondirective 
interview—which have previously been used primarily for service pur- 
poses—for research in personality. 
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CHAPTER III 


Evaluation, Trend, and Survey Studies 


J. WAYNE WRIGHTSTONE, FRED P. FRUTCHEY, and IRVING ROBBINS 


Tus chapter presents brief reviews of selected evaluation, trend, and sur- 
vey studies for the period July 1945 to June 1948. Most of the studies 
represent applications in the more usual educational situations without the 
benefit of financial support from private foundations or agencies. The 
emphasis in the various appraisal technics reveals a growing interest in 
personal and social characteristics as well as the more intellectual factors. 
Some slight increase in follow-up types of studies is apparent. 


Formulation and Definition of Objectives for Evaluation 


An outstanding contribution to the field of evaluation, The Measure- 
ment of Understanding (67) contains lists of objectives stated in terms of 
behavior and specific illustrative evaluation technics for the fields of social 
studies, science, mathematics, language arts, fine arts, health, and physical 
education, home economics, agriculture, technical education, and industrial 
arts. Harris (44) discussed various problems in school appraisal, and 
Wrightstone (99) stressed evaluating primary school achievement in terms 
of multiple objectives thru the use of standardized and teacher-made tests 
and technics in areas like readiness, attitudes, interests, creative expression, 
and personal-social adaptability. 

At the close of World War II, the Extension Services of the Land Grant 
Colleges and Universities, the United States Department of Agriculture 
cooperating, completed thirty-one years of existence as an educational 
agency to rural families. A committee (94) was appointed to review its 
work and define the scope of the Extension Service educational responsi- 
bilities as they can be foreseen for the coming years. Likewise a 4-H Club 
committee (93) reviewed the objectives of 4-H Club work for boys and 
girls, and reformulated these objectives as ten guideposts for 4-H Club 
work. 

At the college level, studies of objectives and their evaluation were 
conducted. Dunkel (25) formulated a list of twenty main goals from 
papers written by students and teachers. Patterns of goals for individuals 
were obtained by having each student rank all goals by the method of 
paired comparisons. Marsh (62) asked a group of college students who 
had just completed their first course in psychology to indicate how 
important fifteen stated objectives were for them and how well they 
thought these objectives had been attained in the course. Only a fair 
degree of relationship existed between how important an objective was 
judged to be and how well it was attained in the course. 
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Construction and Refinement of Evaluation Technics 


Greene, Findley, Couey, and Stanton (40) reported training the instruc- 
tors at Air University, Maxwell Field, Alabama, in construction of evalua- 
tion instruments. The instructional staff cooperated with test specialists in 
planning, constructing, editing, analyzing, and interpreting the findings 
from tests, rating scales, and other evaluative instruments. Dexter (22) 
constructed a questionnaire for students’ evaluation of a course of study. 
Sartain (80) devised a scale to measure student attitude toward the diff- 
culty of a college course and applied the scale to various psychology 
courses to show differences between value, interest, and difficulty of these 
courses. 

In studying the relative effectiveness of paper-and-pencil tests, inter- 
views, and ratings for personality evaluation, using the California Test of 
Personality and the Woody Student Inquiry Blank as criteria, Jackson 
(52) concluded that each of the technics has differential values for pre- 
dicting personality traits. Raths (72) discussed social accomplishment 
and the process of test construction as factors making for the validity of 
the Social Acceptance Test. Raths and Metcalf (73) reported a reliability 
of over .90 for The Wishing Well, an instrument for identifying needs of 
elementary school children. In a comparison of a written test on super- 
stitions, with the (unobserved) behavior of a sample of fifty-one ninth- 
grade pupils, Zapf (100) obtained a correlation of .79+.03. A five-point 
scale check list of 200 items based on the ten imperative needs of youth 
as gathered by a committee of the National Association of Secondary- 
School Principals was presented for further refinement by French and 
Ransom (34). 


Evaluation Studies 


Elementary and Secondary School Levels. An increasing interest in 
sociometric studies is reflected in the literature. Blanchard (5) summarized 
eighteen studies made between 1902 and 1946. He concluded that the studies 
were mainly at the elementary level; the technic most frequently employed 
was the questionnaire; and the factors which influenced choice of friends 
were work groups, play groups, social groups, chronological age, mental 
age and IQ, home background, personality traits, development, and social 
adaptability. Austin and Thompson (3) studied children’s friendship pat- 
terns and found that personality characteristics, geographical propinquity, 
and similarity of interests accounted for status of and changes in friend- 
ships. An analysis of social relationships in a war-boom community by 
Morgan (65) thru the use of questionnaire and a sociometric test revealed 
the importance of father’s income and level of school achievement. Cook’s 
experimental sociographic study (13) showed how the teacher, familiar 
with classroom friendship structures, can modify and change such struc- 
tures in the direction of increased group interaction. Horns and Watson. 
(50) showed that for fourth- to sixth-grade children in an upper-class 
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private school, gentile children were more clannish than Jewish children. 
An interracial study of social acceptance by Raths and Schweichart (74) 
showed a generally higher acceptance of boys and girls along color lines 
(white and Negro), altho many white boys and girls showed a high 
acceptance of colored members of their own sex. 

Gates (36) used a teacher rating scale, employer ratings, disciplinary 
records, participation in civic enterprises, participation in school, com- 
munity and church activities, teacher observation, and peer ratings in an 
evaluation of civic competence of 489 high-school seniors. 

Olson (69) reported an intensive study of a third grade in the Univer- 
sity of Michigan laboratory school. Pupils were studied in an attempt to 
improve the quality of social relations in the group. Teachers, a pedia- 
trician, a psychometrician, and a biologist collaborated to analyze the 
psychobiological characteristics of the group. Family and community 
relations were studied. Sociometric tests were made at the beginning of 
the study and again after six months of recommended treatment of the 
group involving changes in teacher technics and suggestions resulting from 
parent conferences. The follow-up study six months later gave the teachers’ 
subjective estimate of improved social relations, but the sociometric tests 
failed to reveal significant changes. The conclusion was that children’s 
social relations in a classroom have deep roots in community and family 
living as well as in the physical, mental, and emotional differences among 
the children. 

Gilbert and Wrightstone reported (37) on the first systematic evalua- 
tion of camping as part of a public elementary-school program. Using 
matched groups of fifth- and seventh-grade pupils and a battery of tests, 
the investigators concluded that camping was a valuable aid in promoting 
democratic education. The authors recommended increased teacher educa- 
tion along recreational lines and an extension of camping experiences and 
their evaluation. 

Rothney and Hansen (77) evaluated a broadcast series on the Wisconsin 
School of the Air. They found, primarily by questionnaire, that pupils 
liked the series, identified the program with their own local towns, and 
showed no evidence of bias on choosing favorite characters in the program. 
In a sub-study of the intercultural attitudes of a control (nonprogram 
listeners) and experimental group (program listeners), some statistically 
significant differences in favor of the experimental group, not necessarily 
attributed to the influence of the radio series, were found. 

College Level. Various studies have compared achievement of veterans 
and nonveterans in college. Epler (27), Clark (9), Thompson and Flesher 
(86), Tibbitts and Hunter (87), Stewart and Davis (84) found generally 
that veterans make as good or better grades than nonveterans. Clark (9) 
interpreted the data as indicating that veterans were better motivated and 
worked harder to succeed in their classes than nonveterans. Flesher (29) 
made an intensive study of seventy-six women who obtained their under- 
graduate degrees in three years or less. Each accelerate was paired with 
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another student of the same age and ability at entrance. The accelerates 
had a superior average grade but were not superior in extracurriculum 
activities. Assum and Levy (2) compared the academic ability and achieve- 
ment of two equated groups of seventy-one students, one group of which 
had applied at the Counseling Center of the University of Chicago for 
help on personal problems. On the college comprehensive examination, 
the less well-adjusted group gave evidence of poor academic achievement 
but had equal achievement on the college reading and writing ability tests. 
Toven (89) compared a counseling program at the college level for a 
group of 188 students who were systematically counseled for four years 
and a control group of 188 students who were not counseled. The study 
indicated that the counseled students had fewer scholastic difficulties, 
completed more point credits, and better realized their aims in attending 
college. Hewitt (46) reported guidance thru self-appraisal in a program 
in which use was made of (a) technical information and personality in- 
ventory tests taken by the students; (b) reports from teachers; (c) student 
autobiography; and (d) the counselor’s summary. Griffiths (41) studied 
the relationship between scholastic achievement and personality adjustment 
of men college students. He concluded that men with brilliant scholastic 
records are not better emotionally adjusted than those with lower academic 
achievement and that unsatisfactory personality scores are not significantly 
correlated with unsatisfactory grades. Kilby (56) reported that students 
who received remedial reading instruction in college earned significantly 
higher final grade averages than did those in an untrained control group. 
Bloom (6) studied the implications of problem-solving difficulties for in- 
struction and remediation in the College of the University of Chicago. 
Poorer students showed greater difficulty in understanding the nature of 
problems, probably as a result of. reading disabilities. They exhibited 
attitudes of a lack of confidence and were unable often to relate a problem 
to information already possessed. Attempts to change problem-solving 
methods on an experimental basis have been successful and have resulted 
in improved examination marks. Wells (97) reported the psychometric 
work of the Grant Study with well-adjusted Harvard undergraduates. The 
data from psychological tests were integrated with data from general 
medicine, physiology, anthropometry, and psychiatry to evaluate adjust- 
ment to higher education. Reed (75) compared two colleges, K and M, 
on the Michigan Vocabulary Profile Test. Only 8 percent of the K students 
equaled or exceeded the median for M students. The study showed M 
students who had a higher average intelligence test score had read more 
books and more and better magazines. 

Extension Education. In extension education, which is informal out-of- 
school education, evaluation studies have been conducted to obtain research 
findings on organizing rural adults and young people for informal teach- 
ing; their needs, interests, and social situations; planning educational . 
programs; effectiveness of methods of teaching; the use of volunteer 
leadership; and measurements of results in terms of the objectives of 
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teaching. The conduct of these studies is based on the proposition that 
objective data can be obtained on which to formulate administrative policy 
and modify teaching procedures. 

Studies of the use of radio as a teaching method have received con- 
siderable attention in recent years. Crile (15) studied the extension radio 
program in Ward County, North Dakota. She found that farm families 
were reached over the radio who were not otherwise participating in 
extension programs. Listeners acted upon the information they received 
about farm and home practices. In Wisconsin three combinations of ex- 
tension teaching methods were evaluated, radio and leaflet; radio, local 
leader and leaflet; and local leader and leaflet (20). Radio instruction 
was used effectively in Massachusetts to reinforce the teaching of home- 
makers by local leaders on making coats, suits, dresses, and other garments 
(71). In New York State also, the effectiveness of the radio in teaching 
a technical subject (dress making) thru a long series of lessons was tested 
(83). The listeners adopted many new sewing practices. In Minnesota, 
farm people prefer the interview type of radio program in extension 
education, according to Hanson’s (43) study of listening habits. Most 
farm families have radios. The extension radio programs reach farm 
families not otherwise participating in extension activities. The findings 
of studies of farm and home radio programs were summarized by Crile 
(21) for administrative and program-planning use. A number of other 
studies of extension teaching by radio are now in progress. 

The use of printed materials as a method is widely used in extension 
teaching and has been receiving attention in extension studies. Thru 
personal interviews with 216 farm families, Arbour and Mason (1) found 
that a monthly guide of things to do on the farm and in the home in 
Louisiana was used by those who received it. The need for simplifying 
the readability of Extension Service publications resulted from the study. 
Clark and Mason (10) found that a monthly leaflet on homemaking and 
extension news distributed to rural homemakers in Connecticut was widely 
read with more than half using the information in the leaflet. In a study 
of bulletin readership in New York, Ward (95) found the bulletins widely 
used, read, and accepted but found that more emphasis needed to be 
placed on pictures, simpler presentation, and shorter publications. 

Burleson (7) in Louisiana analyzed 351 of the mimeographed letters 
and announcements which county extension agents send to farmers and 
homemakers on timely farm and home practices. He found the language 
too difficult and concluded that the agents rate higher in their knowledge 
of subjectmatter than in their skill to impart it thru letters. County exten- 
sion agents in Kansas furnish local editors with nearly half of the farm 
news appearing in local papers. Hilgendorf (47) found that the local 
editors preferred separate news stories rather than farm columns. They 
felt the people would read and get more out of farm news stories if they 
were adapted to the local situation. 

Cowing (14) has found thru the analysis of hundreds of written mate- 
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rials for farmers and homemakers that the reading difficulty level of the 
materials was too high for good comprehension. They are being simplified 
by using easier words, shorter sentences, and more personal references. 

The degree of organization for home demonstration teaching with rural 
homemakers varies considerably over the country and even within states. 
In Massachusetts, Billings and Collings (4) found that communities having 
an advisory council member and a community committee had more suc- 
cessful programs than communities having a less complete organization. 
In the former, local leaders tended to continue longer in service, were 
better acquainted with their functions, had a clearer uncerstanding of the 
scope of the Extension Service program, participated more in program 
planning and were more confident in their ability to make decisions on 
home problems. 

Some Extension Service studies were devoted to evaluating the results 
of extension teaching in a county or over a state. Frutchey and Wing (35). 
thru a geographical sampling of over 200 farms, interviewed farmers and 
homemakers in Windham County, Connecticut. They found that more 
farmers with large enterprises adopted recommended practices than 
farmers whose farming activities were part-time or a sideline. They con- 
cluded that when the economic stake is high the adoption of recommended 
farming practices is higher. 

In Vermont an independent agency was asked to evaluate the work of 
the Extension Service thru personal interviews with a carefully selected 
sample of farm families. The results are given in two publications (90) 
(91). The study covered sources that farmers and homemakers use in 
getting information, farm and home practice changes, and attitudes toward 
extension teaching. 

A cross-section sample of 212 farm families was interviewed in Pontotoc 
County, Mississippi (64). It was found that the use of more than one 
teaching method increased the adoption of practices and that practices 
emphasized over a long period of time were more widely adopted. 

A factor in increasing 4-H Club membership for informal teaching of 
boys and girls, is holding the membership of the older members. A 
cooperative study (53) in the New England states dealt with the charac- 
teristics of 4-H Club members thruout their high-school careers as well 
as with the characteristics of their clubs and local leaders. 

Another cooperative study (59) brought out factors influential in getting 
parents’ cooperation in 4-H Club work for their boys and girls. The re- 
searchers concluded that informed and invited people are interested people, 
and interested people are cooperative people. 


Trend Studies 


An excellent summary of trends in research, measurement, and evalua-- 
tion in the past fifty years, including extensive bibliographical references, 
was written by Scates (81). Sabrosky (79) analyzed the annual reports 
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of the states for 1946 and prepared statistical summaries of 4-H Club 
work on enrolment, number and size of clubs, age of members, length of 
membership, reenrolment, completion of the work, local leadership, and 
time county extension agents devote to 4-H Club boys and girls. Grandy 
(39) made a similar analysis for the State of Colorado covering the 
twenty-year period 1926-46. Sabrosky (78) also analyzed the annual 
reports of the states for 1946 and prepared statistical summaries for home 
demonstration teaching with rural homemakers. 


Follow-up Studies 


Franzén (32) summarized the responses to an opinionnaire of those 
who participated in the Cooperative Study of Secondary-School Standards. 
He found that revisions were needed especially in the sections dealing with 
outcomes, instructions, and teacher evaluation. An analysis (33) of the 
responses of 532 administrators whose schools were evaluated in the years 
1940-1947, showed the need for improving various procedures. 

At the college level, two follow-up studies have been reported. Knox 
(57) reported a sampling of eight Harvard graduating classes, distributed 
over the period 1880 to 1925. He found that graduation with honors was 
significantly related to prominence as determined by inclusion in Who’s 
Who in America. A combination of scholastic class and outstanding extra- 
curriculum achievement supplied the best basis for predicting future 
success. In a follow-up study Jones (54) compared certain measures of 
honesty at early adolescence with honesty in adulthood. A coefficient of 
contingency of .37 was obtained between honesty as measured in early 
adolescence and adulthood. These results suggest that honesty depends 
upon progressive organization of inner and overt behavior and that such 
organization is usually well under way in late childhood or early adoles- 
cence. 

Holcomb (49) made a follow-up study of 151 county extension agents 
on the job in Iowa. The agents found their subjectmatter training in college 
very helpful to them in their job of extension teaching but felt they had had 
insufficient preservice training in extension teaching methods, extension 
administration and organization, organization and teaching in 4-H Club 
work, adult education, program planning, office management and per- 
sonnel, technical journalism, and evaluation of extension programs. Prac- 
tically all of the agents favored induction training in these subjects for 
beginning workers. 


Surveys 


There were a variety of local, state, and national surveys in the litera- 
ture of this three-year period. House and Thompson (51) made a limited 
survey of the constitutional growth of a small sample of seventh-grade 
Appalachian children and found various deficiencies affecting child be- 
havior. Hawk (45) surveyed the speech needs of 1200 elementary-school 
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children and found seventy students who needed special attention. 

Administrative surveys included a questionnaire study (28) made by 
the U. S. Office of Education on the growing professional status of the 
public-high-school principal. Woody (98) inventoried 635 Michigan school 
administrators on the current problems or issues which they were facing 
in the state. Otto (70) found practices in the use of achievement tests, 
guidance in program making, departmental instruction, promotion prac- 
tices, and library services in 286 public elementary schools and forty-six 
campus demonstration schools to be very similar. A questionnaire study 
(66) by Nannini showed that small town New York State school adminis- 
trators consider housing, recreation, and education the main community 
problems; labor, inflation, government controls, housing, and education 
the main national problems; and atomic energy, feeding starving people, 
and United Nations the main international problems. Duckworth studied 
(24) the report forms of transfer students and found a serious lack of uni- 
formity in grading systems. Flesher surveyed the elementary-school build- 
ings (30) and secondary-school buildings (31) in twelve Ohio cities. He 
found inadequacies for both levels, tho the secondary-school buildings were 
newer and relatively more adequate. Butterworth and Gragg (8) reviewed 
an extensive list of school surveys conducted from June 1943 to June 1946. 

Remmers and Davenport (76) presented data collected from the Purdue 
Opinion Poll for Young People showing how over 6000 high-school young- 
sters from thirteen states felt about such things as being a teacher, liking 
school, student government, work experiences, field trips, and teacher 
salaries. 

Haas (42) received 415 replies to a questionnaire mailed to the 461 
Wisconsin public secondary schools. He found that the schools allotted an 
“adequate” proportion of the curriculum to social studies, altho the titles 
of course contents followed traditional patterns. Also, he reported that a 
large percent of the small high-school social studies teachers were inade- 
quately prepared. 

McNelly (60) surveyed the county agricultural agent staff that was 
on the job in December 1947. Their average length of tenure was eight 
years. He also determined what they liked about their work and what they 
did not like. 

A subcommittee (85) made a survey of working conditions of extension 
personnel to determine a basis for making recommendations on the im- 
provement of working conditions and reducing turnover of personnel. 

Collings (12) made a comprehensive nationwide analysis of how the 
home demonstration agents use their time. One hundred seventy-seven 
agents selected proportionately over the country and at random in a region, 
kept a daily record by five-minute periods for one week in the winter and 
one week in the summer of their activities making 2422 days of extension 
activities for analysis. She found that the agents spent one-fourth of their — 
time teaching and that they worked an average of 51.5 hours a week 
including some Saturdays and Sundays. 
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Thru a survey of 666 farmers in Alabama, Leigh (58) found that the 
average Alabama farmer reported using twenty-three ideas about improved 
farming practices which he received thru several means of communication. 
The number of ideas the farmers used increased consistently with the 
amount of education they had and the size of farm. Sixty-five percent 
of the farmers reported getting good farming ideas from children who 
brought them home from school. 

A nationwide radio survey (92) of attitudes of farm and small town 
people was made thru a stratified random sample of 2535 households in 
116 counties where 4293 interviews were made. Information was also 
obtained on program preferences. 

Social relationships studies have been made to determine the organiza- 
tions in a community and how they can function in channeling extension 
information to the people. Hoffer (48) found that the farmers of Eaton 
County, Michigan received useful information about farming from many 
sources. The neighborhood group and the rural school district were used 
very often to develop extension work. According to Miller and Beagle (63) 
school districts were a leading unity factor in social organization in Living- 
ston County, Michigan. The religious factor was next in importance. 
Niederfrank (68) studied the coordination of agencies in carrying on 
an effective educational program in health and nutrition in a parish in 
Louisiana. 

Studies were made to determine needs, interests, and abilities as a basis 
for program planning. Handicraft and rural art activities (19) (26) were 
determined thru a nationwide survey including Hawaii and Puerto Rico. 
Many studies of local volunteer leadership have been made in the past 
because of the importance of the local leader in conducting extension 
work. The findings of these studies were summarized by Crile (16) for 
administrative use. Crile also prepared a bibliography (18) of extension 
studies and a review (17) of extension studies for 1946-47 which is a 
valuable source. 


Frequency Studies 


A review of recent research in vocabulary development was prepared 
by Seegers (82) who gave a comprehensive summary of studies reported 
in recent years. In more specialized studies Cobb (11) compared the 
vocabularies of the Basic Vocabulary of Business Letters with the Gregg 
Shorthand Dictionary and pointed out the similarities and differences 
between them. Tiremen (88) conducted a study which he felt established 
definitely that a large percent of the error made by native Spanish-speaking 
children in recognizing English words in isolation was caused by failure 
to pronounce correctly the elements of the words. In a study of slang 
vocabulary in a large school, Kasser (55) analyzed changes in slang 
vocabulary thru comparison with the results of a similar survey conducted 
eleven years before. Most slang words originated with the high-school 
group and spread to younger children. 
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A vocabulary analysis of twenty pre-primers, published since 1937 and 
representing eleven different series, showed that recent pre-primers have 
lower different-word count than earlier material and that there is an in- 
creasing similarity of vocabulary in the different levels (23). Malter (61) 
analyzed eight available studies dealing with children’s preferences for 
illustrative materials and found that children (a) prefer colored illustra- 
tions, (b) have a wide variety of interests in subjectmatter, (c) do not 
like silhouettes, and (d) have preferences subject to change. Applying 
the Dale-Chall formula for reading difficulty, Guckenheimer (38) found 
in an analysis of thirty-six pamphlets on international affairs, that 75 
percent of the material is at or above the college level in difficulty. A 
validity of .86, based on a correlation between the composite rating of 
seven judges and the formula, was reported. In an analysis of the distribu- 
tion of emphasis in ten physics tests and twelve physics textbooks, Weaver 
(96) found that the textual contents differed considerably and that the 
tests differed in emphasis of material and types of test items. 


Needed Research 


The need for more rigorous research on the formulation and definition 
of major objectives of education is apparent. Present research is more 
opportunistic than systematic and comprehensive. The emphasis on in- 
formal, teacher-made evaluation technics represents a desirable trend, but 
research and development of more formal and comprehensive technics 
have progressed very slowly in recent years. Rationale and procedures 
in evaluative studies should be studied to establish more definitive criteria 
and principles. In surveys, trend studies, and frequency studies, a critical 
appraisal of procedures, criteria, and applications should be undertaken 
by one or more investigators to indicate improved designs and methods 
for these types of research studies, 
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CHAPTER IV 


Research Methods and Designs 


JOSEPH LEV 


Rererences included and reviewed in this chapter were selected because 
they contain a discussion or illustration of some aspect of experimental 
design, or application of such design. 


Validity of Conclusions Based on Experiments 


An experiment in educational research commonly involves selection of 
a group of individuals and the application of some experimental procedure 
to these individuals, in order to obtain conclusions which are applicable 
to a population of which the subjects in the study are a sample. For such 
extension of reasoning or “induction” to be valid, it is necessary to elimi- 
nate the various possibilities of bias in the experiment. Much attention has 
been given in educational research to the elimination of bias in the ex- 
perimental procedure. In particular, care is taken to eliminate bias on the 
part of the experimenter by use of objective measures. Consideration is 
also given to making the experimental procedure fit the situation to which 
the experiment applies. Less attention has been given to the possibility of 
bias due to the selection of individuals who are the subjects of the ex- 
periment. Clearly, the subjects of the experiment must constitute a repre- 
sentative sample of the population, to which the conclusions of the experi- 
ment are to be generalized. Frequently, the logic of induction is accorded 
recognition only at the end of the experiment when tests of significance 
are applied but such tests are not valid if the sample is biased. Criticism 
of this aspect of experimental design was voiced by Walker (49), who 
noted that use has not been made of available knowledge of sampling theory. 

A detailed analysis of sampling bias in several psychological studies 
appeared in two papers by Marks. In the first of these papers, Marks (32) 
questioned the validity of generalization from two psychological studies 
because of sampling bias. It is sufficient to review his comments on one 
of these studies. The responses of a group of psychology students in 
Dartmouth to a questionnaire on race prejudice showed that such prejudice 
was related to the education of their parents. Since the students were all 
from the same college, it is possible that selective factors operating in 
their admission to the college provided a biased sample in relation to 
the total population of America, or even to some more restricted popula- 
tion. By means of hypothetical data, Marks showed that selective choice 
from a population, in which the education of parents is unrelated to racial 
prejudice of children, might have resulted in a sample in which a marked 
relationship between the factors exists. Unless random selection is used, 
there can be no assurance of the elimination of bias in the sample. 
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In a second study Marks (33) reviewed the sampling used in the revision 
of the Stanford-Binet scale, which was of the type known as a cluster 
sampling, since subjects were obtained in groups from a number of com- 
munities. Marks pointed out the overemphasis given to a few communities 
in the sampling. Because of the similarity among individuals within a 
group, there is a possibility of intraclass correlation which influences 
statistical calculations. Marks suggested that greater validity might have 
been achieved by sampling more communities with fewer cases in each 
community. He also gave formulas appropriate for calculation of standard 
errors in cluster sampling. 

The importance of relating the design of experiments to the logic of 
induction has been especially emphasized by Fisher (17). Altho Fisher 
dealt mainly with agricultural experiments, the logic of his argument is 
universally applicable. In an experimental procedure it is often necessary 
to take into account the operation of variables other than the ones with 
which the experiment is concerned. Thus, in a comparison of teaching 
methods, it is often necessary to consider the variation of intelligence 
among pupils, and the differences of teaching ability among teachers. In 
the simplest design this variation can be provided for by random sampling 
from the total population. It is often possible, however, to assure a more 
satisfactory sample by sampling randomly at various levels of the variables 
which disturb the experiment. Thus, in comparing teaching methods, it 
may be desirable to sample separately from boys and girls. Fisher pointed 
out that such a modification of experimental design called for corre- 
sponding modification in the logic of statistical inference. Since the dif- 
ferences between boys and girls have been excluded from the process of 
randomization, these differences must also be excluded from the calculation 
of the standard errors used to judge the statistical significance of the 
comparisons made in the experiment. By the use of analysis of variance 
it is possible to subdivide the total variability in the data into components 
which are due to the operation of known variables entering into the ex- 
periment, and other components which are due to random variation and 
are used in tests of significance. Beginning with a simple psycho-physical 
experiment, Fisher described a number of experimental designs and their 
related tests of significance. Methods of calculation appropriate to these 
designs were given by Engelhart (16) and by Snedecor (40). 


Comparison of Groups When Individuals Are Classified 
into Categories 


Frequently, research studies are concerned with individuals who may be 
classified as belonging to one or another of several discrete categories, 
such as sex or school grade. In such studies the observations are classified 
into the several categories and the frequencies in the categories are 
determined. The analysis then proceeds in one of two directions: (a) the 
observed frequencies are compared with expected frequencies based on 
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theoretical considerations, or (b) the observed frequencies in several 
samples are compared with each other to determine whether they represent 
populations which have the same distribution. 

A study using a theoretical distribution is that of Malter (30), who 
rated children on a test of ability to read cross-section diagrams. A theo- 
retical distribution was constructed on the assumption that all responses 
were guesses, a like distribution was made up from the observations, 
and the two distributions were compared. 

Two groups were compared in a study by Dolger and Ginandes (14). 
A group of children in a private school and a group of the same number 
in a public school were asked to offer solutions to a disciplinary situation 
told to them as a story. Their solutions were then classified as constructive 
or nonconstructive. The authors found a larger percent of constructive 
solutions among the private school children than among those in the 
public school. Some comment on the sampling may be of interest in this 
connection. It must be assumed that the conclusions based on the data 
were meant to be generalized to a population extending beyond the children 
in these schools. Since the sampling is limited to only two schools, all 
the numerous characteristics which differentiate the children of one 
school from those of the other specialize the populations to which general- 
ization may be made to an unknown extent. 

A more complex classification was used by Doane (13) in a study 
of professional curriculum requirements for high-school teaching. Colleges 
offering teacher-training courses were classified into three categories 
composing the three populations in the study. Samples from the three popu- 
lations were then compared on the basis of proportions offering specific 
education courses. Chi-square tests of significance were calculated. 


Comparison of Two Groups on Measurable Traits 


Two comparisons are made: (a) the responses of two groups to the 
same set of stimuli (an intelligence test, for example) or (b) the responses 
of a given group of individuals to two sets of stimuli. In the first instance, 
random samples are drawn from the two populations and the groups are 
compared on some statistical measure, usually the mean. In the second 
instance, two different samples from the given population may be used, 
or the two sets of stimuli may be applied to the same sample. 

Several studies comparing two populations of individuals appeared in 
the literature. Heisler (19) compared children who read comic books 
with children who did not read such books, on intelligence, personality 
and socio-economic status. Hobson (21) compared boys and girls on 
primary mental traits. Leeds and Cook (28) used two groups of teachers, 
one superior and one inferior, to validate items on a scale for determining 
attitudes of teachers toward pupils. 

An example of a study, in which the responses to two sets of stimuli 
were compared by use of a single sample, is a study by Malter (31). The 
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children were asked to read a diagram of a production process, both 
with and without the use of certain guide lines and arrows. The children 
were then tested on their comprehension of what appeared in the diagram 
under both sets of circumstances, and the mean scores of their responses 
were compared. 

Von Eschen (46) used two designs to study the effect of supervision on 
teaching efficiency. In the first design, two different groups of students 
were selected, one to be an experimental group, and one a control group. 
The teachers of the experimental group were subjected to a prescribed form 
of supervision, whereas those of the control group were not so supervised. 
Mean pupil gains in achievement of the two groups were compared. In the 
second design, the control group of the first experiment became an ex- 
perimental group the following year. Mean gains of the group for the two 
years were compared. 

In a more complex study Gold (18) compared responses of students to 
various methods of teaching dental health knowledge. Two similar groups 
of students were chosen for the principal part of the study, which was to 
compare the effect of intensive training in dental health with only inci- 
dental training in the subject. From each of the principal groups a sub- 
group was chosen for training in a manner differing slightly from the 
main group. In the analysis, groups were compared in pairs, the two main 
groups with each other, and each main group with the subgroup drawn 
from it, three comparisons in all. Since four methods of instruction were 
used, an analysis of variance comparing all four methods at once would 
seem appropriate. Such an analysis would yield a sensitive test for all 
comparisons, and would make possible an investigation of the relationship 
between length of instruction and improvement in dental care. 


Comparison of Several Groups on Measurable Traits 


The patterns of experimental design considered in the foregoing section 
on two group comparisons may be extended to the simultaneous comparison 
of several groups. As in the former situation, group comparisons are made: 
(a) the responses of several samples of individuals to the same set of 
stimuli, or (b) the responses of a given sample to several sets of stimuli. 
The statistical method applicable to these designs is the analysis of variance. 

A study by Blum (3) illustrates the comparison of responses of several 
populations to the same stimuli. Students preparing for five professions 
were compared on a number of personality traits. Analysis of variance was 
used to test the variability among the means of groups. As is customary 
in such studies, comparisons were made on each trait separately, altho 
methods for simultaneous comparison on several traits are available. 

A study by Baldwin (1) illustrates a comparison of responses of the 
same individuals to differing stimuli. The Fels Parent Behavior Scales 
were given to a group of mothers during the periods of pre-pregnancy, © 
pregnancy, and post-pregnancy. The hypothesis that the means of the 
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scores for the three periods were equal was tested by analysis of variance. 
A test of the linearity of the means was also made. 


Experiments in Randomized Blocks 


In order that the conclusions of an experiment comparing several 
procedures may be generalized to a fairly extensive population, it is 
frequently necessary to sample individuals in several environments; for 
example, schools, communities, and grades. All procedures must then be 
tried in each environment, for, if some procedures are tried in one 
environment and others in another, it will not be possible to determine 
whether resulting mean differences are attributable to procedural or to 
environmental effects. Furthermore, it is necessary to take into account 
differences among individuals within a given environment. The last is 
achieved by assigning individuals within an environment to the various 
procedures in a random manner. The resulting design may be called a 
randomized-block design, because of its parallel to agricultural experi- 
mentation, in which randomization is performed within blocks of land. 

This design is well described by Burt and Lewis (7). Twenty backward 
readers were selected randomly from each of five schools, The children 
from each school were divided randomly into four equal groups, each group 
to be taught reading by one of four methods of instruction. At the end 
of the period of instruction the gain of each pupil was determined. Using 
the gains as scores, an analysis of variance was made, yielding mean 
squares due to methods of instruction, schools, method-school interaction, 
and variation within the twenty subgroups. The authors used the variance 
within subgroups as the error variance to test the significance of the 
differences among method means. This calculation is questionable. It is 
valid if the conclusions are to apply only to the five schools used in the 
experiment. If, however, the conclusions are to be generalized to some 
population of schools going beyond these five schools the variance due to 
interaction should have been the error variance to test the differences 
among methods. 

Bollinger’s (4) study of the impact of teachers on pupils provides 
an illustration of an investigation carried on in several environments 
without elimination of bias due to environmental differences. Certain 
social attitudes and aspects of social adjustment of teachers and students 
in the high schools of three small communities were investigated. Bollinger 
found that in one of these communities both teachers and students had 
lower scores than the corresponding groups in the remaining two communi- 
ties. The differences among the scores of pupils in the three communities 
were then ascribed to the impact of the teachers. Because of the design 
of the experiment, this conclusion has doubtful validity. Both teacher 
and pupil differences may be due to community differences. The last 
interpretation is supported by the results of a portion of the study in which 
the gain of pupils in social adjustment was studied in relation to the 
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adjustment of teachers within a given community. Here a negative relation- 
ship was found, so that the pupils having better adjusted teachers gained 
somewhat less in social adjustment than pupils having teachers rated 
lower in social adjustment. 


The Latin Square 


The randomized-block design makes possible the consideration of a 
source of variation in addition to the variation among the procedures being 
studied. It is frequently desirable to consider two or more sources of vari- 
ation in addition to the basic variation among procedures. A device, which 
provides a method for taking into account two such sources of variation, is 
the Latin square. This design is described by Fisher (17) and by Snedecor 
(40), and has been referred to in previous reviews published in this series. 

Without using the terminology commonly associated with the Latin 
square, Wilson (50) used this design to study the effect of length of reading 
material on comprehension. Essays of 300, 600, and 1200 words were pre- 
pared on each of the three topics, “Paper Today,” “Paper Industry,” 
and “History of Paper.” Each of three groups of children was then asked 
to read essays of all three lengths, but each length associated with a differ- 
ent topic. After each passage was read, a comprehension score was ob- 
tained for each student. The Latin square appears as follows: 











Paper Paper History 

Today Industry of Paper 
Group I 300 words 600 words 1200 words 
Group II 600 words 1200 words 300 words 
Group III 1200 words 300 words 600 words 

















The value of this design is that each of the three lengths was tested 
on all students, so that the variability among groups was eliminated. 
In addition, each length was tested on three topics, so that the bias of a 
particular topic was eliminated. The author failed, however, to utilize 
these values in the analysis of the data, since she made all comparisons 
by using differences of means in the nine cells. The results did not, 
therefore, appear as conclusive as they might if the appropriate analysis 
of variance had been used. Thus, in one of the topics, “Paper Today,” 
the 1200-word essay resulted in little better comprehension than the 
600-word essay, but further analysis shows that the longer essay was 
read by the group which was decidedly the poorest of the three. When 
analysis of variance is used, the variation among the means of the three 
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essay lengths is very significant. A supplementary analysis indicated 
a linear relationship between essay length and comprehension. 


Factorial Designs 





An important aspect of educational research is the determination of fac- 
tors which are related to individual differences. A factor is here understood 
to mean a method of differentiating individuals by classifying them in 
groups. Examples of factors are sex, national origin, and grade placement. 
Individuals in a sample may be classified in accordance with several factors 
simultaneously. Factorial studies usually proceed by obtaining a common 
measure for all individuals in a sample, classifying individuals on the 
) basis of each of several factors, and comparing the means of groups within 
) each factor separately, These studies neglect the overlap of factors, since 
several factors operate simultaneously to distinguish some individuals 
from others. This “interaction” of factors may indeed be the most interest- 
ing aspect of the study. The problem is discussed in (17: Chapter VI) 
; and in (24). 
: Several examples of factorial studies follow: Smith (39) obtained 
: scores on the Bell Adjustment Inventory for a group of students, and then 
compared means of groups in accordance with several factors, such as par- 
ticipation or nonparticipation in athletics, and membership or non- 
membership in fraternities. Stright (42) considered similar factors in re- 
lation to scholarship. Both authors found significant relationships between 
: the factors in question and the criteria. Since the authors neglected to study 
| interactions, certain questions remain unanswered. 

Cheydleur (10) examined the relation of several factors to success 
in college teaching. The factors were examined separately and no account 
was taken of interactions. The need for examining interactions is illustrated 
clearly by the two factors: (a) having professorial rank, and (b) not doing 
graduate work. Both of these were related to good teaching, and surely 
interact. 

Brownell (5), who investigated two factors related to teaching sub- 
traction, did consider interactions. One factor consists of two methods 
of teaching borrowing, and the other of teaching either method mechani- 
cally or by rationalizing the procedure. The analysis of data dealt with 
the subjects as four groups, distinguished by both the method of subtrac- 
tion and the method of instruction. Thus, the analysis may be said to have 
dealt only with interactions, and to have neglected main effects. Compari- 
sons were made by using differences of means between all possible pairs 
of groups. It is worth while noting that the methods of analysis of variance 
would have provided information regarding both main effects and inter- 
actions with a more sensitive test of each. The value of investigating inter- 
actions is further emphasized by Brownell (6: p. 111) in his critical | 
comments on a later study of the same problem. 

Schroeder (37) investigated the teaching of archery using three groups 
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of students all of whom received six lessons in the skill. Three ranges 
of thirty, forty, and fifty yards were used, all students shooting at all 
three ranges at each lesson in varying order for different groups and 
different days. The experimental design permitted an analysis of the follow- 
ing four factors and their interactions: (a) groups, (b) test halves (the 
first three lessons against the last three), (c) order of shooting at a given 
range within a given lesson, and (d) length of range. Information was ob- 
tained of main effects and first, second, and third order interactions 
by use of analysis of variance. 


The “Split-Plot” Design 


This design owes its title to a type of agricultural experiment in which 
several procedures are compared in a set of plots. In addition to comparing 
one plot with another, the plots are divided into two or more portions 
which are treated differently, so that intraplot comparisons can be made. 
There are then, in fact, two parallel experiments with different estimates 
of error (40: p. 309). 

The split-plot design was used by Vergara (45) in a study comparing 
the comprehension of poetry in oral presentation with its comprehension 
when read silently. Sixteen poems of four types and two groups of subjects 
were selected. Each group read eight of the poems silently and listened 
to an oral presentation of the remaining eight. Thus, each poem had both 
an oral and silent presentation and, correspondingly, two scores. Two 
types of comparisons were then possible. Using the sum of the oral and 
silent scores for each poem it was possible to compare the mean scores 
of the four types of poem. Using the differences between oral and silent 
scores it was possible to compare oral with the silent presentation. The two 
comparisons used different estimates of error. 


Use of Concomitant Measures in Group Comparisons 


In the previous sections of this chapter experimental designs were 
considered in which all subjects are scored on a measurable trait and are 
then grouped into several categorical classes in accordance with one or 
more criteria of classification. These criteria of classification were con- 
sidered as sources of variation in the data. It is often necessary to consider 
sources of variation which are based on measurable traits rather than 
categorical classes. Thus, in comparing educational achievement of several 
groups, it is desirable to consider the intelligence scores of individuals 
in the groups. The appropriate method of analysis in this situation is 
the analysis of covariance. 

Johnson and Tsao (24) used two concomitant measures in a factorial 
study of educational achievement of high-school students. The purpose 
of the study was to investigate the effect of four factors on scores in a test 
of scholastic achievement given at the end of the school year. The factors 
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were grade status, scholastic status, sex, and rank order within each grade- 
scholastic-sex subgroup. In addition to the scores on the achievement test 
given at the end of the school year, scores on the same test given at the 
beginning of the year and mental age scores were obtained for all subjects. 
The latter two measures were used to eliminate the component of variabil- 
ity in the final scores due to these measures from the group comparison 
required in the study of the various factors. All main effects and interactions 
were examined. 

Moore (34) used chronological age and mental age as concomitant 
measures in a comparison of vocabularies of orphanage and nonorphanage 
children. Since the two groups differed considerably in both chronological 
and mental age, it was necessary to eliminate these sources of variation 
before a fair comparison of vocabularies was possible. 

Johnson and Hoyt (23) compared two groups on their ability to learn 
physics. Three additional measures, consisting of scores on the American 
Council on Education Psychological Examination, elementary mathematics, 
and honor-point ratio, were obtained for each individual. The two groups 
were then matched on the latter three variables in such a way, that com- 
parison between the groups could be made on each set of three scores 
based on these variables. The matching procedure can be viewed geometri- 
cally if coordinate axes are set up in three dimensional space and scales 
corresponding to the matching variables are set up on these axes. A set 
of three scores is then a point in this space, and the groups can be matched 
at each point. By examining sets of points, regions of significance and non- 
significance can be laid out. The matching procedure was achieved by 
calculating regression equations after the data had been obtained, rather 
than by matching at the beginning of the experiment. 


Relations between Two Measurable Traits 


The study of relations between two traits is so standard a procedure 
of research that it seems scarcely necessary to report on it in this review. 
Some of the applications may, hov =ver, be of interest. The applications 
reviewed here are of two kinds: (a) those dealing with prediction, and 
(b) those dealing with test validation. 

The following studies dealt with prediction. By correlating judges’ esti- 
mates of item difficulties with their true difficulties, Tinkelman (44) found 
that judges were quite successful in predicting item difficulty. Stalnaker 
(41) studied the relationship of achievement in college with scores on an 
entrance examination. Thorndike (43) correlated scores on the College 
Board Aptitude Test with grades on intelligence tests taken at various times 
prior to college entrance to determine the effect of lapse of time on the 
estimation of intelligence. Lantz (27) investigated the value of informing 
teachers of achievement to be expected of children when this expectation 
is based on tests administered to the pupils. Jayne (22) carried out an 
experiment in which ten teachers taught the same lesson and studied 
pupil gain in relation to the procedures of the teachers. 
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The following studies dealt with test validation. In order to validate 
a paper-and-pencil test on superstitions, Zapf (51) created a situation 
in which responses of students to, stimuli associated with various super- 
stitions could be observed and scored. The students were instructed to 
enter a room in which the stimuli were present and their behavior was 
observed without their knowledge. Dyer (15) used actual translations 


made by students as a criterion to validate objective tests of ability to trans- 
late German into English. 


Use of Multiple Measures in Prediction 


In an effort to find some means of predicting behavior, many measures 
are often made and multiple regression estimates are calculated to synthe- 
size the diverse measures. 

Lins (29) studied the possibility of predicting teaching efficiency from 
information available during and prior to attendance at college. Numerous 
measures were obtained for each of fifty-eight teachers from the applica- 
tion blanks filled out at entrance to college, from intelligence tests, from 
an autobiography, and from an analysis of teaching fitness as shown 
during the senior year. Three criteria of teaching efficiency were used: 
(a) a combined rating based on opinions of supervisors and of observers 
from the school of education; (b) a rating based on comments of students; 
and (c) a measure of gain of pupils in educational achievement. Regression 
equations were calculated between the measures of prediction and each 
of the criteria. The possibility of correlating all criteria simultaneously 
with the predictors by use of canonical correlations suggests itself here. 

In a companion study using the same teachers and the same criteria as 
Lins, Von Haden (47) studied the prediction of teaching efficiency by use 
of subjective measures. These measures were obtained from comments 
of interviewers, instructors and supervisors; from an analysis of steno- 
graphic reports of interviews; from autobiographies; and from personal 
data. Regression equations were calculated. In another companion study, 
Jones (25) used objective measures based on data available at entrance 
to college and on college grades to predict teaching efficiency. Hellfritzsch 
(20) made a factor analysis of the measures used by Lins, Von Haden, 
and Jones. Carter and Dudek (9) used several measures in a study of plane 
navigators’ efficiency. Regression equations were calculated using as a 
criterion errors in bringing the plane to a required point. 

Detchen (12) investigated the value of items in the Kuder Preference 
Record in predicting success on a social-science comprehensive examina- 
tion given at the University of Chicago. The items in the test were keyed 
by comparing responses of students who did well to responses of those 
who did poorly on the social-science examination. The resulting test was 
then included in a test battery to predict success in the examination. 
An important aspect of Detchen’s work is the fact that she tested her ~ 
key on a sample other than the one used to establish the key. This pro- 
cedure was necessary, since the key contained many sampling errors and 
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could not have been regarded as valid for the population unless tested 
further. A need for similar testing arises in the studies of teaching 
efficiency referred to above. The authors selected those groups of measures 
which had highest multiple correlation with the criteria, but these multiple 
correlations were highest only in the sample, and additional study is needed 
to determine whether they are highest in the population. 


Comparison of Groups by Use of Multiple Measures 


It is common in studies which compare two or more groups to obtain 
several measures on each individual, but to compare the groups on each 
measure separately. However, procedures are available for comparing the 
groups on several measures at once. 

Lins (29) studied the relationship of a set of measures, which included 
college-entrance tests and high-school records, to admission to a school 
_ of education at the end of the second year at college. There were two 
groups to be distinguished, those admitted to the school of education 
and those not admitted. By use of biserial correlations between the two 
groups and the various predictors, Lins obtained a regression equation 
to predict admission to the school. 

Baten (2) used scores based on judges’ opinions of color, mealiness, 
texture, and flavor to distinguish between types of potatoes. Using discrimi- 
nant function analysis, he obtained a set of weights by which the scores 
can be combined to obtain best discrimination. 

The problem of discriminating between more than two groups was con- 
sidered by Rostker (36) in a study of factors relating to teaching ability. 
Teachers of twenty-four classes were measured on many traits. Pupil gain 
as measured by initial and final tests was the criterion in the study. In order 
to determine whether the differences in pupil gain were related to the 
teacher measures, average pupil gain was calculated for each class, and the 
averages were correlated with the set of teacher measures. To eliminate 
variation due to differences in ability among pupils in the various classes, 
a regression equation was calculated for pupil gain and measures of 
ability. Pupil gain was then measured in terms of deviations from the 
regression estimates. Companion studies similar to that of Rostker but 
using different measures were carried out by Rolfe (35) and La Duke (26). 
Rostker considered each class as an individual and so had only twenty-four 
cases to work with, a small number for a multivariate study. Actually 
there were as many cases as pupils in the classes and, consequently, many 
more degrees of freedom for testing significance. 


Sample Design in Surveys 


Much greater care is often exercised in obtaining a sample when a survey 
is carried out than in the usual research study. An outstanding example of 
a carefully planned survey is that made by the U. S. Office of Education 
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in 1946 to determine higher educational enrolments, and described by 
Cornell (11). The survey was based on a sample of approximately one-fifth 
of all institutions providing higher education in the United States. The 
institutions in the sample were obtained by setting up strata on the basis 
of category of school and size within category, and then sampling randomly 
within the strata. By- use of mathematical formulas and information 
regarding enrolments obtained from previous surveys, the number of 
institutions to be sampled from each stratum was determined, so as to 
provide standard errors of preassigned accuracy. 

The application of sampling methods to public-opinion polls was de- 
scribed by Cantril (8). Smith (38) combined survey methods with those 
of testing hypotheses. The purpose of the study was to determine the rela- 
tionship between level of knowledge and liberal opinion. A sample of 6000 
persons was obtained. For each person, level of knowledge was measured by 
response to four questions based on current events, and liberal opinion by 
response to four opinion questions. The relationship between the two 
measures was studied within groups which were homogeneous both as to 
occupation and income. 


Sequential Sampling 


Sequential sampling provides a modification of the usual sampling 
procedure, since by this method cases are obtained sequentially, one by one, 
or in groups. When applied to testing a hypothesis, the hypothesis is 
formulated and two probabilities are assigned in advance of the sampling 
procedure. The probabilities are due to errors in (a) rejecting the hy- 
pothesis when it is true, and (b) accepting the hypothesis when it is false. 
These probabilities are arbitrary, and are based on the practical importance 
of these errors. On the basis of the hypothesis and the pre-assigned prob- 
abilities, certain calculations are made at each stage of the sampling 
sequence, i.e. after each observation or groups of observations. The calcu- 
lations lead to one of three decisions: (a) reject the hypothesis, (b) accept 
the hypothesis, (c) continue sampling. It has been found that sequential 
sampling reduces greatly the number of cases required for a decision as 


compared to ordinary sampling. The theory and numerous applications 
were described by Wald (48). 
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CHAPTER V 


Observational Methods of Research 


SAUL B. SELLS 


Tae three-year period covered by this chapter was deeply affected by 
the war, one important effect of which is apparent in the displacement of 
research personnel during that period. Nevertheless a considerable quantity 
of research in all fields was published. The studies summarized here appear 
to make contributions to or to illustrate particular aspects of observational 
research methodology. This chapter, altho organized somewhat differently, 
continues the previous review of observational methods by Sells and 
Travers (115). 


Direct Observation Technics 


A wide variety of forms of direct observation for collection of basic 
data has been noted. Some investigators use written documents or recorded 
factual material as a source with appropriate technics of recording and 
analyzing. Others are based on observation of human behavior in action, 
in actual real-life, clinical, and experimental situations. Systematic studies 
employing direct observation of behavior have been confined mostly to 
infants and young children. Much useful information might be obtained 
by extension of thesetechnics to other groups. 

It should be noted, however, that the value of any technic of direct 
observation as a definitive research method depends on the rigor with 
which it is applied. Direct observation and especially records from direct 
observation are subjective. Reliability of data, or reproducibility, can be 
increased by standardization of procedure, by the proper use of codes or 
recording schedules having formally defined categories of classification, 
by the use of apparatus aids, such as moving tapes or stop watches for 
time samples, and by the use of additional observers to check on the 
original observer. The value of results based on these methods must be 
judged in part by the evidence of reliability of observations. The represent- 
ativeness of the sample of population observed must similarly be evaluated. 

It will be noted that many observational studies are either delinquent 
in their consideration of the aforementioned restrictions or else they are 
frankly preliminary and exploratory in nature. Some of the studies reported 
below treat reliability adequately. Few have remarks on the statistical 
adequacy of the sample population. However, each of them has some value 
as a representative of a particular observational procedure, and it is on 
this basis primarily that selection was made. 
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Sociological-Psychological Survey 


Stroup’s (119) study of the Jehovah’s Witnesses was an attempt to 
understand this movement in its historical and social context. Using origi- 
nal observations of public and private practices of the sect, and many of 
their publications, he described the history of the movement, its organiza- 


tion and hierarchy, and the literature, beliefs, attitudes, and conversion 
experiences of the Witnesses. 


Content Analyses of Documents 


Davison (31) and Dallin (29) analyzed Soviet propaganda by content 
analysis of selections from the Soviet press. Davison selected a sample 
of four Russian-controlled papers for fifteen days, tabulating the number 
of items published according to six preselected propaganda themes. He 
compared these data with the total material available in the news service 
bureaus. Dallin analyzed the increasing proportion of anti-U. S. attitudes 
manifested in Soviet “news” articles and editorials. Dollard and Mowrer 
(33) outlined a method of measuring tension in written documents. 


Analysis of Paintings 


Alschuler and Hattwick (1) obtained paintings and samples of use of 
other creative media such as crayon work, clay, blockbuilding, and dra- 
matic play for 150 children, two to four years of age, from five public 
nursery schools in Winnetka, Illinois. They studied the entire group for 
a year, and planned to follow up a number further. Analyses of these data 
were compared with expression of characteristic feelings in overt behavior. 
They found evidence of comparable forms of expression of universal ex- 
periences by children of different social backgrounds. Elkisch (36) studied 
a sample of 2200 drawings and paintings by twenty-five children. She 
also selected eight children, according to their group popularity by socio- 
metric methods, whose art products were analyzed in detail. This study 
is interesting because of the definite analytic criteria developed in terms 
of rhythm vs. rule, complexity vs. simplexity, expansion vs. compression, 
integration vs. disintegration, and realism vs. symbolism. Waehner (126) 
studied the drawings of fifty-five college students. Both spontaneous and 
assigned drawings were scored according to content and to preference for 
certain types of formal expression (size of paper, size of form elements, 
quality of lines, organization of form, shading, etc.). A psychologist, a 
Rorschach expert and teachers then matched the personality sketches, based 
on these analyses, with the drawings. A high degree of identification 
was reported. 


Factual Studies 


Morton (95) analyzed the questions regarding aviation asked by 3262 
children in Grades I to VIII, by tabulating the proportion of arithmetical 
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questions concerned with number, size, and time. From this he drew con- 
clusions regarding the mathematical interests of boys and girls at various 
elementary-grade levels. Baker (5) collected 9280 questions from 1402 
children in Grades III to VI in sixteen cities. She analyzed them by content 
and considered their implications for the elementary-school curriculum. 
Masuoka (87) compared records of food purchases of 100 Hawaiian- 
Japanese families for thirty consecutive days in 1933-34 with (a) records 
of foods purchased by rural Hawaiians in 1928 and (b) data for Japanese 
in Japan. Changes in diet for the Hawaiian group were found to be in the 
direction of adoption of American foods. 


Recording of Speech 


Irwin (58) and Chen and Irwin (23) have recorded speech-sound data 
of infants using the International Phonetic Alphabet. Irwin presented sta- 
tistical evidence of the reliability of the method, particularly for the vowel 
sounds. Chen and Irwin showed that at two and one-half years the infant 
possesses nearly the full complement of adult vowel sounds but only about 
two-thirds of the consonant types. Bossard (17) made transcripts of family 
conversations at meal time for thirty-five families and supplemented these 
with interviews. From these data he developed analyses concerning (a) 
range and meaning of family vocabulary, (b) levels of language, (c) lan- 
guage as a social index of occupation, religion, and social class, (d) family 
idiosyncrasies in communication (i.e., peculiarities, private meanings, 
taboos), (e) patterns of conversation (e.g., subjective, objective) and (f) 
speech characteristics and pronunciation. Johnson and Colley (63) had 
twenty stutterers read a 1000-word, phonetically edited passage, while two 
concealed observers recorded the duration of each stuttered movement on 
a moving tape. They studied the correlation between frequency and dura- 
tion of stuttering movements. 


Case Histories: Individual Case and Exploratory Observations 


Many valuable facts, relationships, and skilled insights are frequently 
stimulated by observations made incidentally in other than purely research 
activities. The following examples were selected to illustrate the importance 
of such reports. Axline and Rogers (4) reported a detailed case history 
revealing how a skilful teacher-therapist treated a maladjusted 6-year- 
old boy. Clark and Barker (24) presented the verbal report of an intelligent 
18-year-old Negro “zoot suiter” of his participation in the Harlem 
riot of 1943. Dukes (34) described two cases to illustrate the devastating 
effect that the (wartime) disruption of family life can have on children 
and their parents. Anderson (2) cited the clinical findings of eighteen 
aphasic cases, illustrating two types, (a) in which linguistic difficulty 
interfered with the development of speech, and (b) in which disorders 
were observed producing linguistic regression. Goldfarb (48) made ex- 
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tensive investigations of the life histories of fifteen adolescent children in 
institutions. His results supported his previous conclusions that infant dep- 
rivation results in a basic defect of total personality especially in concept 
formation and as an attitude of passivity and emotional apathy. The reports 
of Knapp and Cambria (75) and Stevens (117) used case records as their 
source. 

Peller (101) published a list of over 100 items designed to aid teachers 


of children from two and one-half to six years in observing significant 
behavior symptoms. 


Observational Research Studies 


Muste and Sharpe (96) studied preschool children in two situations in 
order to observe aggressive behavior. Their results cover technics used 
by children responding to aggression and the relation of frequency and 
type of aggressive behavior to age, sex, and environmental background. 
Wellman and McCandless (128) used two technics, short sample records 
of child behavior and a method called “following the teacher for a full 
day,” supplemented by intelligence and vocabulary tests, in a study of 
thirty-four preschool children. Newman (97) developed technics for 
describing the behavior of up to 100 junior-high-school pupils of each sex. 
Noon-hour playground behavior was rated by three judges using a com- 
posite scale of behavior patterns, each analytically characterized by paired 
opposites and an integral scale consisting of gross behavior characteriza- 
tions of the same traits. Clubhouse behavior was rated by four judges on a 
scale consisting of twenty characteristics described in terms of the two ex- 
tremes and the middle of a seven-point scale, and twenty other traits de- 
scribed by a single adjective or phrase of each extreme. These ratings were 
supplemented by narrative and conference records on the behavior situa- 
tions. Her report covers reliability data for the ratings and examples of the 
scales used. Lannert and Ullman (78) made a valuable study of piano 
sight reading by observing nine advanced students play unfamiliar selec- 
tions. Their analysis resulted in some important conclusions concerning 
the development of skill in reading musical scores. Van Bruggen (123) 
studied factors affecting regularity of the flow of words in written compo- 
sitions by observing and recording mechanically the time required for 
writing each word, and the number and length of pauses between words, 
in three types of composition written by eighty-four pupils i in Grades VII, 
VIII and IX. These observations were interpreted in relation to kind of 
composition, M.A., C.A., and scores on reading, vocabulary, and spelling 
tests, as well as personality measures. 


Analysis of Radio Programs 


Katz and Eisenberg (67) and Peatman and Hallonquist (100) reported 
studies using the Lazarsfeld-Stanton Program Analyzer Test. This method 
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enables each member of an audience to indicate continuously by an indi- 
vidual signal, objectively recorded, his like or dislike for every moment 
and part of a radio program. Katz and Eisenberg found that listeners do 
not reject educational programs as such, but want them to be entertaining. 


Longitudinal Studies—Biography 


This heading implies two different approaches; yet each involves con- 
sideration of the development of the individual in the perspective of his 
life span. Gesell, Ilg, Ames, and Bullis (47) have produced an important 
longitudinal study, based on periodic clinical examination of fifty children 
from a prosperous American community, at five, five and one-half, six, 
seven, eight, and nine years. A smaller number of ten-year-olds were 
examined. The investigation covered a wide range of individual and social 
behavior, which was treated in terms of growth gradients and growth 
trends. Koshuk (76) made a preliminary report of 500 developmental 
records collected over a three-year period in two California nursery schools 
attended by children of employed mothers. The records included a pre- 
entrance interview with the mother, observational notes and semester reports 
by teachers, and an interview with the mother when the child was with- 
drawn from the nursery school. Her analysis was concerned primarily with 
changes in home behavior and social adjustment. 

Hill and Ackiss (53) suggested the use of the life history method, based 
on detailed, intimate, insight interviews, as a method of understanding 
the attitudes, feelings, and beliefs of special groups, and obtaining a more 
complete understanding of the dynamics of race relations in our society. 
They presented three condensed life histories together with their analysis 
of each in reference to the problems of an all-Negro community. They be- 
lieve that this methodology especially lends itself to the study of race rela- 
tions, since it focuses attention on the irrational, emotional, and psychic 
quality of racial attitudes; that it bridges the gap between the factual com- 
munity survey and the personality-culture study. 

Kelley (70) discussed the autobiography as a useful diagnostic and 
therapeutic tool in psychiatry. It is a source of information concerning 
the individual; it has diagnostic value with reference to the quantity and 
form of the patient’s writing; it focuses on the patient’s problems; it is an 
instrument of catharsis and helps the individual know that he is helping 
himself. 


Ratings, Rating Judgments, and Rating Scales 


While direct observation frequently involves judgments by the observer, 
these judgments are primarily descriptive and classificatory rather than 
evaluative. Rating judgments involve observation as a source of informa- 
tion, but the emphasis is on evaluation. A critical appraisal of rating 
judgments as an instrument of research must recognize the requirements 
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of reliability and sampling. These two problems are interrelated in ratings, 
inasmuch as one source of unreliability of ratings is the peculiar set of 
biases occurring between samples. These include (a) differences in sample 
composition in attributes of members, (b) differences in standards or 
criteria employed by different raters in relations to different samples, (c) 
different biases and halos affecting raters with regard to different samples. 
The validity of ratings, however, is the most critical problem. This is the 
question of identification of the attribute or function rated with the true 
representation of the attribute or function, and is usually demonstrated by 
correlating the ratings with a suitable criterion. 


Assessment of Men 


The psychological division of the Office of Strategic Services, under 
the leadership of Henry A. Murray, developed valuable technics of as- 
sessment by which personnel were evaluated and selected for wartime 
assignments of highest requirements and responsibility. One of the sig- 
nificant aspects of the assessment program was the emphasis on an inter- 
pretive synthesis of the whole person, in contrast to the elementaristic 
approach so much in vogue. MacKinnon’s report (82) describes the method. 


Rating Scales 


Metfessel (91) proposed that a scale of cardinal numbers be used in 
expressing comparative judgments so that the subject either actually or 
symbolically manipulates units of the ratio scale of cardinal numbers in 
expressing his judgments of quantitative relations among the items on a 
given dimension. He claimed that this method causes comparative judg- 
ments to be made with greater sensitivity than is the case with ordinal 
scales and individual differences are likely to be given more consideration 
by judges using this method. 

Weinland (127) outlined a technic for improving the reliability of 
rating scales by obtaining descriptive words for rating scales, which will 
be in the workers’ language, and will be conducive to using the whole scale 
without overloading at the average. The method involves first writing a 
list of descriptive terms for a particular characteristic, then evaluating each 
word by placing it on a nine-point scale, and finally selecting the best 
words for a five-point scale at the approximate relative positions of two, 
four, six, and eight. Kelly (71) presented a useful technical report on the 
development of a scale for rating pilot competency. He described the 
method of developing the scale and the data on item interrelationships. 
The preliminary “man-to-man” graphic scale covering skill, emotional 
stability, and judgment was discarded because of high intercorrelations 
between traits, indicating lack of independence. The final scale combined 
the best points of several other attempts into a fourteen-item graphic scale. 
A factor analysis of item intercorrelations of this scale showed that three 
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factors, (a) skill, (b) judgment, and (c) emotional control account sub- 
stantially for all of the variance. 


Rating of Applicants for Positions 


McCoy’s (88) comprehensive report described trends and research in 
evaluation and rating of Federal Civil Service unassembled examinations. 
Job analysis precedes and forms a basis for examining technic, which con- 
siders applicant’s training and experience records, supplemented by various 
documentary and interview evidence. Examiners evaluate applications 
numerically on rating schedules. The current emphasis is on quality rather 
than quantity of qualifications. A refinement of procedure was the establish- 
ment of rating factors, based on the specific skills or abilities needed. 
Quality ratings on a graduated scale were assigned to applicants on rating 
factors established on the rating schedule for each position. Brody (20) 
described a method he developed for judging candidates by observing them 
in unsupervised group discussion. Ten to twelve candidates are observed 
at a time, and each may be asked to speak for three to five minutes on 
some selected topic, following which all participate in group discussion. 
The employer-interviewer is free from participation and may devote all 
his attention to observing and rating candidates. Brody cites how the 
New York City Department of Health has used this group method, judging 
candidates for appearance, manner of speech, attitude toward the group, 
leadership, contribution to group performance, and scientific approach. 


Rating of Teachers 


Seagoe (114) found that the success of twenty-five teachers, after two 
years of service, measured by ratings by school administrators directly 
over them, was correlated with test scores on personality, emphasizing 
mental health, obtained while they were in training. Prognostic tests 
of teaching ability given at the same time showed no predictive ability. 
She also found relatively high correlations with the criterion for training 
teachers’ judgments of teaching success, but not for grade-point ratio. 
Von Haden (125) used a composite of five supervisory ratings, pupil evalu- 
ations, and pupil gains as criteria, and interviews, digests of interviews, 
autobiographies, and instructors’ and supervisors’ ratings during teacher 
training as predictors, in a study of fifty-eight women teachers in first- 
year positions. He found that subjective personal data correlate most 
highly with supervisory ratings of teaching success, but are not closely asso- 
ciated with objective measures of teaching effectiveness as indicated by 
pupil evaluations and pupil gains. Barker (7) used personal interviews and 
case studies, as well as ratings by superiors in a study of personality ad- 
justment of teachers. School principals in consultation with their general 
supervisor selected three teachers from each of twenty schools, on the basis 
of (a) the best teacher, (b) an average teacher, and (c) a below-average 
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teacher. Her findings showed a marked pattern of relationship of adjust- 
ment measures to efficiency in teaching. 


Merit Ratings of Personnel 


Probst (106) described his Probst Rating System with examples for a 
variety of occupations, together with technical treatment of item selection, 
foreign translations, oral interviews, and experimental procedures. Ryan 
(111) argued that the logical difficulties and practical inadequacies of 
graphic rating scales with numerical scoring are bad enough to offset any 
advantages claimed for them, and suggested a simple scale for merit rating 
until more adequate means of measuring an employee’s value can be 
found. This is a three-point scale ranging from outstanding thru average 
to poor, restricting extremes to the upper and lower 10 percent. This type 
of scale employs judgments which can be made with some confidence, altho 
it has compensating shortcomings in the massing of the middle 80 percent. 
It can be refined somewhat by requiring separate judgments on special 
qualifications and abilities, such as dependability, specialized technical 
knowledge, ability to instruct others, and ability to get along with others. 
Mahler (84) prepared an annotated bibliography covering employee rating 
methods, administration of merit-rating programs, types of rating methods, 
and research and reports on the use of merit ratings. In another paper 
Mahler (83) reported a survey of current practices in employee rating in 
125 companies, in which most of them use a scale method. A total of 131 
different traits were listed on these scales, which generally were individual 
scales using from one to thirty-three traits which were often poorly defined 
and overlapping. The number of degrees for rating each trait ranged from 
three to sixteen. Ferguson (39) developed an employee merit-rating system 
for a large company. This system was especially designed to measure the 
factors having a direct bearing on the success or failure of a group of 
workers on identical jobs. He described the development of the system and 
the experimental work of validation. 


Ratings of Courses by Students 


Barkley (8), Fowler (41), Marsh (85), Savage (113), and Taylor (122) 
investigated the appraisal of courses by students. In general, favorable rat- 
ings predominate. These ratings seem to be unrelated to scholastic aptitude, 
altho Savage found a tendency toward less approval associated with low 
grades. Fowler reported that students generally rated their instructors 
higher than the instructors rated themselves. Both students and instructors 
believed the ratings “served a good purpose.” Taylor and his associates 
discussed arguments for and against such ratings which they believe valu- 
able for the teacher and the administration but reported that the majority 
of the Smith College faculty recently voted adversely on this issue. 
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Ratings of Social and Personal Factors 


Anderson (3) had the male and female heads of 344 New York farm 
families rate their own families on a four-point scale with respect to two 
indicators of social status and three indicators of social participation. 
Ratings of status were higher tha those of participation. Analysis of 
ratings in relation to objective measures of social status and social par- 
ticipation showed that most participation and leadership activities are not 
only carried out on the basis of social standing in the community, eco- 
nomic position, tenure status, and family maturity, but also as an expres- 
sion of the opinion of the family about its own social position. Remmers 
and Kerr (109) measured cultural, aesthetic, and economic aspects of the 
home environment of 16,445 eighth-grade children in forty-two cities 
in twenty states by means of the American Home Scale. The cities chosen 
were approximately equally spaced in Thorndike’s G scale. Correlations 
of city averages with Thorndike’s G (goodness of living), I (income), and 
P (personality) indexes are low and not significantly different from zero. 
The authors believe that the American Home Scale has higher face validity 
and measures more directly and validly goodness of living, functional in- 
come, and personal factors than do the Thorndike scales. The treatment of 
validity in this paper is of interest inasmuch as there is no independent 
criterion by which to judge the conclusion. Kaufman (68) studied biases 
in ratings in an interesting fashion. He had fourteen individuals judge 
the prestige rank of members of a New York rural community and studied 
the judgments with respect to agreement with judgments of other raters, 
objectivity and discrimination in judgment, and biases evident in judg- 
ments. Results showed that despite high correlations of individual with 
composite group judgments, individual objectivity (ability to distinguish 
between personal likes and dislikes and the way in which a person is re- 
garded in the community) varied considerably, and to some extent the 
judge’s objectivity seemed to be influenced by his own prestige rank. 
Individual bias also varied considerably, apparently heavily determined 
by the judge’s attitude toward the social group with which the person rated 
was affiliated. Goodenough (49) had ten men and ten women judge the 
sex of 115 high-school students from specimens of their handwriting and 
report degree of confidence of the judgment on a three-point scale. Identi- 
fication judgments were correct in about two-thirds of the cases, with no 
sex difference in success of judgment. However, all women judges’ confi- 
dence judgments exceeded the median for the men, and a positive correla- 
tion was found for both sexes between accuracy of judgment and confi- 
dence. Rokeach (110) studied factors affecting judgments of beauty by 
college women. Using five separate groups, he had each subject rate her- 
self and each other member of her group on beauty. He obtained three 
scores for each person: (a) average rating by all others in group, (b) self 
rating, (c) average rating attributed to others. The results indicate in part 
that persons below the average of the group tend to overestimate their own 
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beauty, that those who have insight into their own possession of beauty tend 
to be objective in rating others, and those who have a high degree of 
beauty tend to be objective in rating others regardless of degree of insight. 
Mitchell (93) presented two high-school classes numbering seventy-five 
with a schedule entitled “Types of Persons I Have Met.” This list contained 
forty-four classifications, such as Alibi-Ike, Goody-Goody, Honest Type, 
Pussy-Foot, and Tough. Each student was asked to write in each category 
the name of a classmate who best fitted it. He found considerable agree- 
ment on judgments in both classes which he interpreted in terms of the 
tendency for a person’s social stimulus value to be consistent. Jacobson (61) 
divided a freshman class of 285 women into eleven groups and had each 
member evaluate her group mates on appearance and behavior. She an- 
alyzed 9076 responses into forty-nine subtopics and classified these into 
five categories. The result was that responses of a psychological nature 
were most frequent, with grooming second, physical characteristics third, 
clothing fourth, and intelligence fifth. Favorable evaluations were more fre- 
quent than unfavorable or neutral ones. 


Sociometric Ratings 


Sociometry is not a new method. However, the wide applicability and 
adoption of sociometric measures in the study of problems of group dy- 
namics is one of the conspicuous occurrences of the current period. Moreno 
(94), its originator, discussed the contributions of sociometry to research 
methodology in sociology in the study of intergroup relations, as distin- 
guished from its twin study of interpersonal relations. The following papers 
by Potashin (104), Olson (99), Bonney (15) (16), Blanchard (12), and 
Jacobs (60) are concerned with studies of interpersonal relations. In 
her study of children’s friendships, Potashin defined friends as pairs of 
children in which each gives to the other the highest choice in a sociometric 
test in a classroom, while nonfriends are pairs in which one gives the 
other his highest choice, but the latter does not reciprocate. She compared 
friends with nonfriends and found that sociological factors are of little 
significance in determining friendships, while physical or intellectual simi- 
larity are even less significant; a child who is one of a pair of friends is 
usually well accepted by the group while a nonfriend is not so well accepted. 
Friends also seemed to participate in group activity more and to require 
iess adult direction. Olson reported a highly interesting experiment in 
which a sociometric analysis of a third-grade class, supplemented by stud- 
ies of family and community relations, was the basis for a successful effort 
to improve human relations among the children in the classroom. The 
study also gave rise to a conclusion that children’s social relations in a 
classroom have deep roots in community and family living as well as in the 
physical, mental, and emotional differences among the children. Bonney 
asked 100 sixth-grade children to indicate individually with whom they 
played most often, and those whom they would prefer to have on their 
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side for a quiz-kid program. He studied the amount of reciprocating be- 
tween those chosen rarely and those chosen frequently. The tendency of 
the low group to choose high group more often was more pronounced in 
choice of quiz-kid teammate tlian in choice of playmate. Jacobs asked 
seventeen girls in the same office, individually, to name those with whom 
they would prefer to work in close proximity. The tabulated results re- 
vealed certain attractions and repulsions among the girls which were not 
known to the management of the firm, but which nevertheless probably af- 
fected office morale and productivity. 


Interviews 


Perhaps the most ambitious and scientifically rigorous investigation of 
human behavior ever undertaken using the interview as a method of data 
collection is the so-called Kinsey (72) survey of sex behavior. At the time 
of this report, 10,500 case histories were completed, based on first-hand 
interviews with persons “of wide social range, of all ages, and diversity 
of educational, occupational, religious, and rural-urban backgrounds.” 

The interview is a basic method of research, as above, as well as of 
evaluation, guidance, and therapy. The following group of studies illus- 
trates methodological aspects of the interview method. 


Projective Technic in Interviewing 
Pepinsky (102) described how he used a lithograph of a bleak land- 


scape on the wall in his office in the manner of a Rorschach card. As a 
counselor, he directed the client’s attention to the picture during the 
course of a counseling interview, without making the client suspicious or 
hostile. He claimed that this technic provides a release of tension for the 
subject and a possibility of diagnosis by uncovering hidden motives. 
Rautman and Brower (108) used ten pictures from the Thematic Apper- 
ception Test in interviewing 536 public-school pupils in Grades III to VI 
and their twenty teachers in December 1943. Each pupil or teacher wrote 
a brief story about each picture and answered the following questions: 
(a) What is happening? (b) How are they feeling? (c) How will it end? 
The resulting data were examined for evidence of war-inspired themes. 
Symonds (121) used forty-two pictures, especially drawn for the study, 
each containing at least one adolescent figure with which an adolescent 
might identify himself. He secured stories on each picture from twenty 
normal boys and twenty normal girls in junior- and senior-high school and 
tabulated the number of themes presented. In this preliminary report he 
stated that the count of themes presented should serve as normative data 
against which to compare fantasy productions of individuals and groups. 


Interview in Casework 


Voiland, Gundelach, and Corner (124) formulated a casework approach 
intended to assist the caseworker in the diagnosis and treatment of people 
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and their problems from the very beginning of casework. They stressed 
the importance of developing rapport in the first interview and the need 
for meeting the client on his own ground. Kay and Schick (69) described 
the “role-practice” method of training interviewers in developing skill and 
insight. A study of social conflict in a community of mixed ethffie’ groups 
required that interviewers obtain information with regard to informants’ 
intimate, private attitudes. Interviewers were trained by practicing, in suc- 
cession, each of several roles representing different approaches of the inter- 
viewer, i.e., friendly, inquirer, stranger, analyst. 


Interview in Evaluation 


Newman, Bobbitt, and Cameron (98) reported reliabilities of from .83 
to .85 in independent interview ratings by two psychologists and a psy- 
chiatrist with 399 Coast Guard Reserve officer candidates and 137 SPAR 
officer candidates. The bases of evaluation were complex and covered (a) 
ability to pass academic training, (b) ability to withstand psychological 
pressures and tensions of the training program, and (c) ability to with- 
stand the trauma of combat and demands of service life. Ratings were 
on a thirteen-point scale. The report covers factors influencing reliability. 


Patterned Interview 


McMurry (89) cited several studies which revealed the extent to which 
interviewers using the patterned interview were able to predict ultimate 
job success. The patterned interview usually is a printed form containing 
specific items to be covered and providing a uniform method of recording 
information and rating the interviewee’s qualifications. The principal 
advantage of the patterned interview is that it eliminates much of the guess- 
work and provides greater uniformity of results. 


Interviews in Research 


Huang and Lee (56) used interviews in an experimental study of 
child animism. They asked twenty children from six years to eight 
years seven months questions about a dog, tree, river, stone, pencil, 
bicycle, ball, automobile, watch, and the moon. The children indicated 
whether they believed the object was living, had life, felt pain when 
pricked, was capable of wanting, could be good, had anything it must 
do (function), and performed this purposefully; then they amplified this 
statement. Results were analyzed in terms of animistic beliefs and in 
relation to age. Duvall (35) interviewed ten fiancées and sixty-seven 
wives of servicemen to study the effect of wartime separation of women. | 
He found that loneliness was mentioned most and was a function of the 
degree of social participation, the more active wives being less lonely. 
Witty, Coomer, and McBean (130) secured interviews with children 
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in kindergarten and the first three grades, and questionnaires for children 
in Grades IV thru VIII, totaling 7879, to ascertain their favorite books 
or stories. The children’s selections agreed closely with lists of books 
selected by adults. Maturation of interests was observed from grade to 
grade. The report lists preferred books, with rank awarded to each, for 
kindergarten, primary, intermediate, and upper grades. Jones and Arrington 
(64) made a qualitative and quantitative study of the explanations 
of physical phenomena given by 161 Negro and 134 white children in 
Grades III to VIII to find whether there is justification for a belief that 
Negroes are more superstitious than whites. The results did not support this 
hypothesis; very few answers in either group indicated belief in either 
the supernatural or in superstitions. 


Questionnaires 


The questionnaire is undoubtedly the most frequently used of all 
observational technics, having obvious advantages in administrative 
economy of time, money, and effort over other methods, However, savings 
accomplished by this method are frequently made at the sacrifice of sin- 
cerity, completeness, reliability of response, and validity of interpretation. 
Technics of questionnaire construction and standardization are well estab- 
lished. It appears that they are too often disregarded, however, because 
they involve expenditure of time, effort, and skill which compensate for 
savings accomplished in administration. Unfortunately many studies are 
published which use questionnaires uncritically and without adequate 
preparation. 


Studies of Questionnaire Technic 


Marsh (86) demonstrated the influence of set or attitude created thru 
verbal instructions upon responses to questionnaires. He gave two groups 
of college students a fifty-seven item form entitled “Campus Issues In- 
quiry.” One group was instructed to judge the items impersonally, while 
the other was told to judge the items in terms of their own personal 
experience. Significant differences were found for items considered 
critical issues. Gerberich (45) made a study of consistency of responses to 
questions, using sixty questions, some dealing with fact and others with 
personal and social adjustment, selected from professional application 
blanks and attitude and personal adjustment questionnaires. These were 
interspersed with others in three ostensibly different questionnaires and 
administered on three successive occasions to 657 college students who 
were not told that the selected questions were repeated. The consistency 
of each person for each question was analyzed; for one-day intervals it 
was 76 percent, for ten-day intervals 74 percent; percent for women 
slightly higher than for men; and percent for factual questions lower than 
for attitude-adjustment questions. 
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Questionnaire Investigations 


The following studies demonstrate types of problems which have been 
investigated successfully using questionnaires. Baker and Peatman (6), 
covering a full sample of Veterans Administration advisement units in 
operation on July 1, 1946, sent the director of each unit a questionnaire 
asking him to indicate the tests which he found most useful in his unit. 
This type of investigation, requesting concrete information, and covering 
a complete sample of qualified respondents represents an optimal use 
of questionnaire technic. 

Cupps and Hayner (28) gave a sample of 182 men and 132 women 
students at the University of Washington a three-page questionnaire in- 
quiring into their experiences and feelings about dating. Kirkpatrick and 
Caplow (73) at the University of Minnesota obtained information from 
students on courtship problems. Brav (19) had fifty women, married from 
two to twenty-seven years (average twelve and one-half) in a small 
southern community, fill out a detailed intimate questionnaire giving 
their views on the honeymoon, based on their own experience. As in 
these three studies the questionnaire is often a good vehicle for obtaining 
intimate, personal information, provided respondents are properly 
motivated to answer frankly, their identity is adequately concealed, 
preferably by keeping replies anonymous, and the information requested 
has not been distorted by forgetting. 

Wittenborn, Larsen, and Mogil (129) used two questionnaires regard- 
ing study habits, one in French and one in Spanish. Each included items 
relating to personal and emotional reactions, knowledge of subjectmatter, 
skill, and study technics. They gave these to college students and correlated 
the responses with objective indexes of academic performance. It was 
found that a large and useful group of study habits is significantly re- 
lated to the criteria. Stoughton and Ray (118) had 344 children in Grades 
Il, IV, and VI answer the following questions: “Of all the persons whom 
you have known, or heard about, or read about, whom would you most like 
to be like? Why do you like this person?” They used a questionnaire form 
in the two higher grades. Responses, segregrated by grade and sex, were 
tabulated under three broad categories of persons or characters: (a) those 
in the child’s immediate, everyday environment, (b) those in remote 
places, including imaginative and fictional, and (c) those in religion. 
Horrocks and Thompson (55) studied the degree of friendship fluctua- 
tion in adolescent friendships in a rural group of 421 boys and 484 girls 
from ten to seventeen years. On each of two occasions, fourteen days apart, 
each one was asked to list his or her three best friends. An index of friend- 
ship fluctuation was computed and data were analyzed by sex and age. 
Harris and Watson (52) attempted to answer the question “Are Jewish 
or Gentile children more clannish?” thru a questionnaire to eighty-two 
children in grades four to six of an upper-class private school in New 
York. Each child was required to list his three best friends under three 
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conditions: (a) choice limited to members of his class, (b) choice limited 
to children attending the school, and (c) choice limited to children out- 
side of school. Altho the sample was highly restricted, the design of this 
study is interesting. The results were analyzed directly by comparison 
of percent of in-group and out-group selections under each category of 
response. This treatment permitted the conclusion that Gentile children 
were less likely to choose Jewish friends than were Jewish children to 
choose Gentile friends. 

Pratt (105) and Zeligs (131) studied children’s fears and annoyances 
respectively using the open-question group-response technic. Pratt had 
a group of rural children from four years to fifteen years ten months, 
list their current fears, indicating three things feared most and three 
feared least. Responses were analyzed by grade and sex. Evidence was 
found of cultural stereotypes, such as fears of lions or tigers. Zeligs had 
sixth-grade boys and girls of two suburban Cincinnati schools list things 
which annoyed and irritated them. She then classified the responses into 
twelve categories, ranging from social relationships to environmental 
conditions, and incorporated them in a rating scale with response cate- 
gories from like to hate much. The following year she presented this rating 
scale to 285 of the same children and tabulated their response by cate- 
gories. Clear patterns of sex differences were found. 


Mass Surveys 


Mass surveys of large samples of respondents are recognized today as, 
as well as by, big business. They include market surveys, attitude and 
opinion polls; they use various types of questionnaires and interviews; 
and collect data by personal interview, by mail, and by various combina- 
tions of these. Some interesting developments were sponsored by govern- 
ment agencies during the war such as a consumer panel, in which a con- 
tinuous sample of respondents furnished daily information on purchases 
of rationed foods, to assist the rationing agency in setting point values. 
Consumer panels are used to study brand preferences, readership interests, 
or any other specific problems which a sponsor is willing to underwrite. 

All these procedures have certain things in common. They secure in- 
formation from people, using sampling technics, with relatively large 
numbers of respondents, and attempt to develop generalizations from the 
samples concerning the whole or a substantial segment of the population. 
Notwithstanding these similarities, however, there are wide differences in 
technic of securing information and in sampling required for different 
problems in this general field. These problems of method have been widely 
discussed in the literature. 


General Discussions 


One of the most comprehensive discussions is McNemar’s (90) critical 
review of opinion-attitude methodology, in which he carefully analyzed 


488 























December 1948 OBSERVATIONAL METHODS OF RESEARCH 





the issues involved in measuring attitudes by scaling technics, in single- 
question opinion gauging, in the administration of tests, in statistical 
treatment and sampling, and in conducting morale surveys. Katz (66) 
wrote a critical comment distinguishing what he called “survey technic” 
from opinion polling. He holds that the single-question type opinion 
poll is applicable only on issues where public opinion is crystallized and 
to which unambiguous questions can be addressed. In such cases, a “cross- 
section” sample may serve adequately. The “survey technic,” on the other 
hand, is necessary in approaching issues in which public opinion is not 
crystallized and requires the use of plural devices, subtle and often indirect 
indexes, more complex sampling and above all careful consideration of 
scientific method in the analysis of such variables. The “survey technic” 
was applied by government agencies during the war to the measurement 
of morale in war industries, and in studying morale as influenced by 
strategic bombing in Germany and in Japan. Campbell (22), in a com- 
panion article, emphasized the limitation of polling to questions that are 
well understood by the public and are clear-cut. For the remaining ques- 
tions he recommended “open interviewing,” which reveals not only the 
respondent’s yes or no answer, but his interpretation of the question and 
the intensity of his feeling about it. Bernays (9), addressing himself only 
to attitude polls and excluding factual and purely quantitative surveys 
on markets, elections, and similar issues, argued that polls are useful when 
used as a guide to current opinion, since they cover only a temporary 
attitude. He was concerned with the prevention of misuse and misinter- 
pretation and recommended licensing sound and ethical polling agencies 
and educating the people and public leaders in the social significance of 
polls. Blankenship (13) edited a symposium to which twenty-nine special- 
ists contributed articles dealing with specific aspects of consumer and 
opinion research and sampling technic as practiced in business, industry, 
government, and agencies reporting to the public. Ferraby (40) dis- 


cussed the qualitative and quantitative technics of mass-observation in 
England. 


Technical Papers 


Slight differences in phrasing, order of questions, and like variations, 
often yield different proportions of favorable and unfavorable responses. 
Suchman and Guttman (120) described an objective method based on 
Guttman’s technic of scale analysis and intensity analysis of dividing the 
responses into pros and cons so that the division is not affected by wording 
or similar variation. The method results in an intensity curve, the lowest 
point of which will be the same for any sample of questions from the 
same scale or dimension of opinion. In another paper, Guttman (51) de- 
scribed the technic of scalogram analysis, a method of quantifying qualita- 
tive data. 


Gallup (43), in reply to criticism of public-opinion measuring, de- 
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scribed a system of question design which has been evolved out of many 
years of research by the American Institute of Public Opinion. This 
method provides an opportunity for probing five aspects of opinion: 
(a) filter and information questions to find out if the respondent has 
given any thought or attention to the issue, (b) open or free answer 
questions to get at unstructured opinions and reveal the direction of the 
respondent's thinking, (c) dichotomous or specific issue questions in which 
the public is asked to stand up and be counted on specific issues, 
(d) questions asking respondents why they hold the opinions they do, 
and (e) questions to measure the intensity of opinions. It is apparent 
that most critics have viewed the pollers as practitioners of the dichotomous 
type of question more or less exclusively. 

Campbell (21), in discussing the open-question technic, pointed out 
that open questions are useful in revealing general attitudes and sug- 
gestions about the sources of opinions, but they permit variation in frame 
of reference and thus allow avoidance of the desired response or make 
difficult the classification of responses. The questioner must be alert to 
follow up with added questions to clarify responses. 

Connelly (25) pointed out that adequacy of sampling procedures and 
reliability of results are subordinate to the major problem of validity, 
which is adversely affected by (a) interpreting a specific response apart 
from its specific stimulus, (b) failure to write poll questions in terms of 
the objective behavior in which the poller is interested, (c) believing 
that the respondents must think clearly (according to the poll author’s 
concept of clear thinking), (d) assuming that the respondents need to 
know the implications of their opinions to answer validly, and (e) in- 
corporating prestige factors into the complex of questions. 

Dodd (32), recognizing the growth of the survey industry, has pointed 
out the need for, and outlined some forty dimensions of excellence as a 
basis for standards for surveying agencies. These are divided into six 
major groups: agency credence standards, questionnaire standards, sam- 
pling standards (with some statistical details given), interviewing stand- 
ards, reporting standards, and administrative standards. These are un- 
questionably worthy of any research agency. 

The questions of administration of field work in large-scale surveys 
have been discussed by Bevis (10) and by Huey (57). Bevis covered 
training and supervision of resident interviewers. Huey outlined the 
special problems and special skills arising in large-scale, organized re- 
search. He described the organization and administration of the Morale 
Division of the U. S. Strategic Bombing Survey in Japan. 

Frazen and Lazarsfeld (42) analyzed certain problems in the use of 
the mail questionnaire. They mailed questionnaires to a sample of 3000 
Time subscribers, 1000 each of three slightly different forms. Five hundred 
five respondents and 882 nonrepliers were personally interviewed. The 
results are stated to indicate that mail questionnaires can produce 
valid samples of comparatively homogeneous groups and that the answers 
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to some questions given in a mail questionnaire are more informative and 


more freely given than the answers the same people give to the same 
question when face to face with an interviewer. They recognized, however, 
that significant biases do occur in mail returns, and precautions must be 
taken to avoid prejudiced generalization. 

Guest (50) compared the results of a consumer jury test carried out 
by magazine votes with one carried out by personal interview. He found 
that the difference was considerable tho not statistically significant. 


Illustrative Studies 





The following references illustrate some of the major categories of 
mass-survey technic, described above, as follows: reader panel: Corson 
(26); opinion study of a special population segment: Hollis (54), 
Peterson (103), Faterson and Klopfer (38); information survey: Mills 
and Atkinson (92), Endicott (37), Kwoh (77); attitude survey: Blum 
(14), Crespi (27), Smith (116), Samuelson (112). 


Combinations of Observational Technics 


Several studies have combined several different technics in a manner 
which justifies special mention. Kitay’s (74) investigation of attitude 
toward religion made extensive use of personal documents, such as auto- 
biographies, essays by students on their attitudes, responses to personal 
data questionnaires and attitude questionnaires. His study was based on 
139 Jewish students of the Commerce Center of the College of the City of 
New York in 1941 and 1942. Radke (107) investigated parental authority 
and discipline in the home environment and its correlates in the child’s 
attitudes and social behavior by a number of methods, including question- 
naire and interview reports from both parents and children, ratings of 
personal-social behavior by teachers, and tests, experimental situations, 
and projective technics on the children. The subjects were forty-three 
nursery school and kindergarten children, averaging four years eight 
months in age, from urban homes, representing a select social, economic, 
and educational sample of the population. Radke found that the pro- 
jective interviews with the young children yielded valuable data in locating 
critical areas in the child’s home relations and in getting his reactions to 
known home situations. 


Observational Instruments and Aids 


Photography 


An outstanding phojperaphic study is that of Gesell (46) which 
summarizes the physicef and social growth of a baby in 800 photographs. 
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Eye Movements 


Brandt (18) developed a camera which permits simultaneous recording 
of vertical and lateral eye movements. This is a valuable device for many 


types of research in reading, attention, illumination fatigue, and related 
problems. 


Hearing Aids 


Davis (30) reported experimental findings and acoustic recommenda- 
tions on the design objectives of an ideal hearing aid. Gates and Kushner 
(44) did a research of extreme importance and interest on the factors in- 
fluencing the decision of children to wear hearing aids. Using tests and 
extensive interviews they discovered many social and personal factors 
of controlling importance which can be adopted in the more effective 
design of hearing aids. Lorge (81) found on audiometer retests after two 
years of a group of twenty-five hearing aid users and a control group of 
equal size, that there was no significant difference between aid users and 
the control group in amount gained in hearing-speech capacity. 


Physiological Measures 


Bitterman (11) using continuous electromyographic recordings of 
cardiac and eyelid activity, obtained while subjects performed visual 
tasks in two different experiments involving reading material in six- and 
ten-point type at three and at ninety-one footcandles of illumination, 
found no evidence that either heart rate or blinking rate is an index of 
the ease of visual work. Loftus, Gold, and Diethelm (80), in an investi- 
gation of electrocardiographic changes accompanying intense emotional 
states, found evidence which pointed to the possibility of psychogenic 
origin of the electrocardiographic changes. There was no evidence of 
cardiovascular disease. Lennox, Gibbs, and Gibbs (79) observed identical 
electroencephalographic tracings in 85 percent of fifty-five pairs of 
monozygotic twins. Results of their study indicated that the brain wave 
pattern is hereditary and that the encephalogram can be used in human 
genetic studies. 
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CHAPTER VI 


Tests and Measurements 


WILLIAM B. SCHRADER and HERBERT S. CONRAD 


Acutevement testing, broadly conceived, constitutes the basic topic of 
the present chapter. In preparing this summary, recognition has been given 
to the fact that certain other issues of the REvViEw cover measurement in 
specific subjectmatter fields. Some overlap may occasionally be observed 
within the fields of intelligence, special aptitudes, or personality, in the 
case of studies having implications for achievement testing. 

Persistent attention to long-standing objectives characterized achieve- 
ment testing during this triennium. Basic conceptions in achievement 
testing were clarified and extended. Progress toward greater directness 
of measurement was substantial, tho gains in this direction were generally 
made at the expense of objectivity or factorial purity. The distinction 
between the measurement of detailed subjectmatter content and the 
evaluation of general educational outcomes was sharpened. Numerous 
large-scale testing programs were energetically conducted. 

Promising fields in which too little work was reported included: (a) 
utilization of research data offered by cumulative records; (b) long-term, 
follow-up studies of the retention of knowledge and skills during and after 
the school years; (c) effect of differing motivation and curriculum em- 
phases upon intercorrelations of achievement test scores; and (d) de- 
velopment of improved criteria for validating aptitude and achievement 
tests. In general, the greatest needs at present relate to test evaluation, 
test methodology, and the effects of testing upon broad educational ob- 
jectives. 


Textbooks and Reference Sources 


The late C. C. Ross (104) completed a revision of his textbook on ed- 
ucational measurement, taking account of recent contributions in the field, 
but placing relatively little emphasis on statistical developments. He 
also prepared a workbook (103) to accompany the revised text. Another 
workbook was prepared by Remmers and Gage (101), for their measure- 
ment text. Adkins et al. (1), wrote a practical, generally useful book on 
achievement test construction, with particular reference to civil service 
work. The chapter on performance testing should be especially valuable. 
A new book on guidance technics by Traxler (138) includes an annotated 
list of widely used achievement tests, along with other valuable material 
on the use of tests. Crawford and Burnham (25), in their book on 
forecasting success in college, presented a thoughtful analysis of achieve- 
ment testing at the college level. Wood and Haefner (154) wrote a 
book on guidance and testing in a highly popularized style. 
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In the field of bibliography, Hildreth (68) prepared a supplement to 
her earlier bibliography on published tests, bringing it up to date as of 
1945. Swineford and Holzinger (121, 122, 123) issued their annual an- 
notated bibliography on statistics, theory of test construction, and factor 
analysis. Finally, mention must be made of the new edition of Buros’ 
important Mental Measurements Yearbook, publication of which is likely 
to fall just beyond the time period covered by the present summary. 


Test Construction 


The development of new tests within subjectmatter fields is most ap- 
propriately discussed in issues of the Review pertaining to these fields. 
Notices of newly published tests also appear in Psychological Abstracts, 
the Journal of Consulting Psychology, and Educational and Psychological 
Measurement (44, 45). The present summary is concerned primarily with 


tests developed for research purposes, or those which present some feature 
of general interest. 


New Tests 


Tests were developed according to logical analyses of subjectmatter 
fields by Harris (63, 64), Spache (114), and Ebert (43). Harris sought 
to measure seven logically discriminable aspects of the comprehension 
of literary materials. He designed an experimental test in which seven 
reading selections (including poetry and prose) were used, and in which 
items were systematically constructed to represent the various kinds of 
tasks. Factor analysis, of the bi-factor type, led him to conclude that 
a single common factor accounted for the intercorrelations among scores 
on the seven aspects of behavior. The seven reading selections used also 
were found to yield only one factor. Spache developed a test of arithmetic 
reasoning, broken down into five logical aspects, corresponding to five 
stages in the sequence of solving arithmetic problems. He was thus able 
to secure five part-scores. The intercorrelations of part-scores were 
apparently computed for widespread groups, and in consequence are 
difficult to interpret. Ebert developed a test of generalization abilities 
in mathematics suitable for eighth-grade students. Three types of perform- 
ance were required: (a) given several mathematical statements, to write 
another example of the common principle; (b) given several mathematical 
statements, to write a verbal statement of the common principle; and (c) 
given a verbal statement of the common principle, to write a mathematical 
example. Rather high intercorrelations were found among the various 
types of items. 

Other tests of considerable interest were developed in connection with 
specific educational research problems. Malter (80, 81) prepared a test 
of the ability of children to read diagrams involving simple symbols and 
another test for ability to read cross-sections. These tests were designed. 
to determine whether or not children in Grades IV to VIII could use such 
aids in printed materials. Reiner (99) constructed a test of cause and effect 
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suited to ninth-grade science students. Students were required to state 
whether a condition in a given sentence was (a) a direct cause, (b) an 
indirect cause, or (c) no cause of an event in a second sentence. Using a 
thirty-five minute, seventy-eight item test, he obtained a reliability of .91. 

Grener and Raths (59) prepared an interpretation-of-data test suitable 
for children in the third grade. Horrocks and Troyer (70) developed a 
case-study test similar to Baller’s “Case of Mickey Murphy.” The three 
cases used were carefully constructed to provide adequate coverage of 
main topics in mental hygiene and adolescent psychology. Items were 
designed to measure ability in diagnosis and remediation. Troyer (139) 
described the procedure used in building a more satisfactory master’s 
examination for students of education. The test included a section on the 
interpretation of professional data and a section which required the 
student to evaluate the appropriateness of an action or decision in relation 
to a specific situation. Troyer noted that there was some initial frustra- 
tion on the part of students when this type of examination was used. 
Nevertheless, the value of the test for diagnosis and its close integration 
with objectives were considered desirable features. Angell (6) described 
the preparation of a test in educational philosophy. Among other things, 
the test required students to evaluate a series of statements in terms of 
each of the three schools of thought, and to rank groups of educational 
issues in order of importance. Robbins (102) used a test in which students 
were required to evaluate adequacy of reasons for certain opinions. He 
found that college students performed slightly but significantly better 
in identifying correct reasons for views with which they agreed than for 
those with which they disagreed. Bottorf (15) developed a test of art 
appreciation stressing art in everyday experience. Students were required 
to express preference for various objects. Evidence for validity of the test 
as a whole was found in the relationship between mean scores and the 
amount of art training a student had received. Nahm (90) prepared a 
situation-response test on mental hygiene for use in a survey of student 
nurses. She found that about two-thirds had a relatively sound viewpoint, 
as judged from test performance, but about one-fourth showed a definite 
lack of understanding. 


Problems in Test Construction 


The yearbook of the National Society for the Study of Education on the 
measurement of understanding (91) made what appears to be an important 
contribution to testing. For each major field of school achievement, thoro 
consideration was given to an analysis of objectives and to the presenta- 
tion of illustrative items dealing with ways of testing or observing the 
desired pupil behavior. Brownell and Sims discussed eight essential char- 
acteristics of understanding and Findley and Scates formulated nine general 
principles to guide the measurement of understanding. This yearbook 
represents a very useful source of test ideas for measuring broader ob- 
jectives in subjectmatter fields. At the college level, the Executive Com- 
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mittee of the Cooperative Study in General Education (4) provided helpful 
material for the evaluation of some of the less tangible outcomes of 
college education. 

Engelhart (48) has summarized a series of suggestions to classroom 
teachers preparing multiple-choice tests for machine scoring. Much of 
the material is applicable to the improvement of any classroom test. 
Mosier, Myers, and Price (87) also presented numerous suggestions 
for improving multiple-choice items, including a list of criteria for check- 
ing such items. Diederich (32) described at some length the examination 
system in use in the University of Chicago’s comprehensive examination 
program. In his judgment, factors which contributed to the satisfactory 
operation of the system included: (a) emphasis on problem-solving 
rather than fact questions; (b) the convention that only objectives agreed 
to by the teaching staff are covered by the examinations; and (c) the 
fact that students need not take an examination until they are ready. 
Various other writers who have discussed problems of test construction 
in specific fields are as follows: Sueltz (119) in arithmetic; Bowers (16) 
in industrial arts; Beckley (10) in retail selling; Barnett (7) in the 
social studies; and Meredith and Burr (84) in the area of intergroup 
education. Hendricks (65) reported on an effort to design paper-and- 
pencil tests for measuring important objectives in chemistry laboratory 
instruction. Validation against performance test results was reserved for 
later study. 


Evaluation 
Validity and Reliability of Tests 


For discussion of aspects of reliability and validity which are primarily 
statistical or computational, reference should be made to the article by 
Travers (132) in the February 1946 Review, and to chapter VII by 
Johnson and chapter VIII by Fattu in the present issue. The discussion 
here is limited to broad problems of validity and reliability, to studies of 
the validity of specific tests, and to evaluations of test technics in terms of 
reliability and validity. 

Validity—In a comprehensive discussion of different kinds of “face 
validity,” Mosier (86) pointed out that “validity by definition” (the test 
duplicates the criterion) is entirely sound in principle; but that such 
validity sometimes comes dangerously close, in practice, to “validity by 
assumption.” Rulon (105) urged application of rigid standards to insure 
duplication between form and content of the test and the desired behavior. 
Tests so constructed could then be used to validate simpler, more practical 
instruments. 

Dyer (41, 42) studied the validity of College Board placement tests in 
German and French, using grades in courses at different levels as the 
criterion. Validity coefficients ranged from .43 to .87 for the German 
courses, and were in the .80’s in the French courses. In a study of the 
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relationship of scores on a recall-type objective test in German to ratings 
of translations, assigned by highly experienced instructors, Dyer (40) 
obtained a validity coefficient of .64 based on 21 students. White (152), in 
a study of the validity of a test of achievement in high-school English, 
compared item analysis results on separate parts of the test with known 
differences in curriculum emphasis between two schools, and found that the 
differences were in accordance with expectation. 

Reliability—Two attempts at a general formulation of the interrelation- 
ships of various methods of determining reliability appeared. Cronbach 
(27) evaluated six procedures in terms of the kinds of variances included 
in the estimate of error-variance used in determining reliability. He con- 
cluded that since no method is generally best, test authors should report 
more than one kind of reliability coefficient. He noted that the estimation 
of reliability generally involves making assumptions which the experi- 
mental conditions probably do not warrant. 

R. L. Thorndike (126) made a detailed analysis of sources of variance 
in performance and evaluated ten possible procedures in determining 
reliability. His report drew upon the extensive experiences of the Army 
Air Forces Psychology Program. He stressed the importance, in the case 
of speeded tests, of obtaining test-retest reliability (by the separate admin- 
istration of equivalent forms). He also pointed out that the determination 
of test reliability is sometimes inherently difficult or unsatisfactory; viz., 
for tests involving insight, tests in which the subject has immediate knowl- 
edge of results, and performance tests in which sets of trials are run under 
conditions which vary appreciably (and generally uncontrollably) from 
set to set. 


Reliability and Validity of Technics 


Plumlee (96) has compared the relative value of three types of mathe- 
matics items at the college-entrance level; viz., “demonstrative,” “multiple- 
choice,” and “answer-only.” She concluded that the answer-only form was 
superior to the demonstrative form both in reliability (.92 vs. .75) and 
in predicting course grades (.44 vs. .37) for comparable testing time. 
She found that the multiple-choice and answer-only forms were about 
equally reliable and equally effective in predicting college marks. The 
multiple-choice form was, of course, more economical with respect to 
scoring. Huddleston (71), also working at the college-entrance level, found 
that an objective test of English grammar, correlated more highly with 
teacher’s ratings on writing ability and with grades earned in English 
during the two years prior to the testing than did an essay test. Each test 
was of twenty minutes’ duration. The Verbal section of the Scholastic Apti- 
tude Test developed by the College Board correlated substantially better 
with both criteria than did the English tests. Good results from an objective 
test in predicting English grades were also obtained by Berg, Johnson, 
and Larsen (12). 
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Problems related to specific technics were also attacked in several other 
studies. Wesman and Bennett (151) did not find any advantage in the 
use of “none of these” as an alternative in multiple-choice items, altho 
they found some indications that this alternative tended to slow down 
response. Using item analysis procedures, Wesman (150) found that it 
was possible to identify correctly spelled words which have value as regu- 
lar items (rather than merely as “filler”) in a true-false spelling test. 
Wesman (148) also studied the relative merits of instructions to mark 
every item “right” or “wrong,” as compared with marking only the wrong 
items, on a test of grammatical structure. No appreciable effect on reliabil- 
ity was found. 


Interpretation of Test Scores 


Flanagan (51) predicted that test norms would tend to place increasing 
emphasis upon socially meaningful statements about performance, along 
with the numerically expressed relative standings now emphasized. The 
development of such absolute standards should aid in the interpretation 
of test results by counselors and administrators. Such an approach implies 
empirical validation and the determination of critical scores in terms of 
extra-school standards, 

Several articles dealt with technical points in the development of test 
norms. Thurstone (127) described briefly a procedure which he has long 
used in making scores on a new edition of a test comparable with scores 
on an older edition of the test, when the original standardization sample 
is no longer available. Harris (62) called attention to a common error 
in the interpretation of grade norms; namely the failure to keep in mind 
that zero attainment goes with a grade score of 1.0. Stein (115) described 
a test of homogeneity of variances, which should be used before combining 
results from different schools when determining local norms. Tucker (140) 
described a new system of norms for College Board foreign-language 
achievement tests. These norms were designed to achieve comparability 
from one foreign language to another, by taking into account the fact that 
the average amount of students’ previous training varied from one language 
to another. Turnbull (141) found that it was not necessary to take into 
account the sequence of topics in physics courses in interpreting the 
performance of students on a college board physics achievement test taken 
two months before the end of the course. 

Closely related to norms is the problem of score profiles. Walker (147) 
called attention to the need for a means of measuring the difference between 
two profiles in such a way that analysis of variance could be used to test 
agreement. Bennett (11) in a brief presentation argued that tests, before 
being used as separate entries in a profile, should meet a criterion de- 
veloped by Kelley. Traxler (137) concluded on the basis of intercorrela- 
tions of part-scores that vocabulary, grammar, and reading scores on 
Cooperative Tests of foreign-language achievement are sufficiently different 
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to justify separate consideration of each. In an experimental study cover- 
ing two years, Tilton (128) attempted to change the relative position 
of arithmetic on the achievement test profile. He found that one of six groups 
studied showed a substantial relative gain; one showed a substantial 
relative loss. 

Relevant to the problem of interpretation of student gains are studies 
by Woodrow (155) and Wesman (149). Woodrow investigated the inter- 
correlations of gains on Metropolitan and Stanford Achievement Tests for 
four elementary-school groups. For three of the groups, gains over a one- 
year interval were studied; for the other, gains over one-, two-, and three- 
year intervals. Altho three factors were tentatively identified in the gains, 
the intercorrelations were characteristically low, indicating that specific 
variance and error variance were conspicuously present in the gain scores. 
He also found that, except for one fourth-grade group, the correlations 
of gains with IQ were characteristically low. Wesman (149) using tenth- 
and eleventh-grade students, correlated gains in intelligence test scores 
with gains on achievement tests. The findings are in accord with the view 
that gain scores, particularly if the lapse of time has been relatively short, 
tend to yield low correlation coefficients. 

A number of reports dealt with factors tending to obscure the interpreta- 
tion of test performance. Of these, most significant was the review by 
Cronbach (26), in which he brought together numerous concepts under 
the general heading of “response sets” (tendency to gamble, speed vs. 
accuracy, and others). Within the field of achievement testing, response 
sets would presumably be most important in the less structured test situa- 
tions (including essay tests), in true-false tests, and in tests where the 
same set of choices may apply to a whole series of items (e.g., true, prob- 
ably true, uncertain, etc.). Cronbach offered a number of suggestions for 
minimizing the influence of response sets. Muntyan (89), in a study of 
retest performance, found that high-school seniors who repeated the same 
tests after a year did reliably better than a presumably comparable group 
of seniors taking the tests for the first time. When alternate forms of the 
test were used in the retest, only the physical science scores were reliably 
higher for the repeaters. Separate norms were recommended as a means 
of overcoming the difficulty. Studies of cheating were reported by Krueger 
(74) and Gross (60). 

Relationships between achievement test scores and various other variables 
were included in several studies. Heston (67) studied the intercorrelations 
between scores on the Graduate Record Examination Tests of General 
Education and Cooperative Achievement Test scores. A factor analysis 
of the eight scores on the Graduate Record Examination Tests revealed 
a factor tentatively identified as “reading comprehension” with substantial 
loadings on all eight tests after rotation, and two other factors with rela- 
tively small loadings. Morgan (85) found that students with higher aca- 
demic achievement within their grade were also on the average significantly 
higher in social acceptance and reputation among children in Grades V 
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to VIII in a war-boom community. Gough (58) found only a slight positive 
correlation between socio-economic status and achievement, after IQ had 
been partialled out, among 127 sixth-grade pupils. Lobaugh (78) found 
that girls averaged somewhat higher on marks, while boys averaged some- 
what higher on achievement tests; insufficient data were presented to 
evaluate the reliability or relative importance of the differences involved. 
Schreiber (109), using fifty-three cases, found no reliable gain in arith- 
metic computation during four years in high school; by contrast, spelling 


and language usage, literature, and hygiene showed marked gains during 
this period. 


Evaluation of Technics for Testing 


Vallance (143) undertook a comparison of essay-type and objective-type 
examinations with respect to their value as learning experiences. Contrary 
to earlier results in this area, he found that there was no evidence for the 
superiority of the essay test either as a learning experience or as a means 
of encouraging students to use a superior learning method in preparing 
for the test. Diederich (33), in a thoro discussion of the measurement of 
writing skill, pointed out that the correlation between two three-hour essay 
examinations written on different topics on the same day did not rise 
above .55. He argued that higher “reliability” in essay examination grades, 
if obtained by artificial devices, may lead to loss in validity. Diederich sug- 
gested that the efficiency of the essay examination for measurement of skill 
in writing may be increased by announcing, about a week before the ex- 
amination, the reading passages upon which the composition will be based. 
Freeman (54) attacked the monopoly now held by objective tests; and indi- 
cated that tests should give more attention to requiring students to express 
thoughts in their own language. Sims (112) argued for the essay test from 
another viewpoint, stressing that the essay test gives valuable insights into 
personal-social development and processes of thought and judgment. 

In an exploratory study, Courtney, Bucknam, and Durrell (24) com- 
pared written-recall, multiple-choice, and oral-recall coverage of the same 
material. This problem is worthy of further study, using large samples 
and varied subjectmatter. Another study which has a definite bearing 
upon test technic was reported by Davidson and Carroll (29). They found, 
by use of factor analysis and multiple correlation technics, that scores 
on time-limit tests of aptitudes frequently represented a mixture of knowl- 
edge or skill and speed of performance. Similar investigations are needed 
in the case of achievement tests. 

A number of papers reported useful technics for achievement testing. 
Tinkelman (129) found that it was possible for judges to rate the relative 
difficulty of achievement test items in a particular field with satisfactory 
accuracy, but judgments of absolute difficulty were less useful. His findings 
imply that items can be arranged in an ascending order of difficulty with-a 
reasonable degree of accuracy on tests which have not been pretested. 
Herfindahl (66) described two procedures for reading standard scores 


455 








REVIEW OF EpUCATIONAL RESEARCH Vol. XVIII, No. 5 





directly, using an International Test Scoring Machine. Taylor (125) out- 
lined a number of shortcuts and technics for efficient use of the test scoring 
machine, and Bice (13) described a procedure for providing detailed 
knowledge of results to students on machine-scored answer sheets. Traxler 
(135) advocated overprinting of answer sheets to facilitate hand-scoring 
and to give students knowledge of their errors. In order to make copying 
difficult, Lemmon and DuBois (77) developed a procedure whereby stu- 
dents are required to answer questions printed in the same order in the 
test booklet on different spaces on their answer sheets. Ryans (107) 
described in detail a procedure for constructing a profile chart, using 
standard score conversions based on local data. Mosier and Price (88) 
presented a convenient system for arranging the alternatives of five-choice 
multiple-choice items in random order. 


Applications 
Planning the Testing Program 


Findley (49) outlined a group testing program suitable for a modern 
school. Among other points related to achievement testing, he recom- 
mended: (a) that the tests be given in the fall to guide curriculum plan- 
ning; (b) that careful cumulative records of results be kept; and (c) that in 
the first four grades measurement be based mainly on informal tests. Durost 
(38) outlined a minimum testing program, urging that achievement tests 
be used at the completion of each major phase of schooling, and that 
supplementary testing beyond the main program be provided for 10 to 15 
percent of pupils. The Spring 1946 issue of Educational and Psychological 
Measurement carried descriptions by various authors of testing programs 
in a number of schools and colleges. 


Counseling 


Bordin and Bixler (14), on the basis of experience with the use of a 
procedure in which the counselor described the tests and allowed the client 
to express feelings in connection with the choice of tests, concluded that 
the technic had promise for increasing the guidance values of tests. They 
recommend that systematic research be undertaken to provide a more 
objective and comprehensive evaluation of possible effects. Traxler (134) 
expressed the view that achievement tests covering broad areas rather than 
tests in specific subjectmatter fields should be used for most counseling 
purposes. Diederich (34) described a program of remedial training in 
English built around the use of comprehensive examinations normally re- 
quiring three hours to complete. He reported that good results had been 
obtained by using recognition tests of writing ability in which students 
were required to choose among alternative ways of completing passages 
in actual student themes. 
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Teacher Efficiency 


A number of pioneering investigations have used pupils’ achievement- 
test gains as an index of teacher efficiency (8, 9, 21, 116). In the best 
studies, such variables as teacher-load, class size, class homogeneity, and 
socio-economic status of the groups are kept constant or are allowed for. 
Gain is measured as the difference between actual and expected gain. The 
expected gain for a pupil is the average gain for individuals who resemble 
that pupil in weighted score on a predictive composite; the predictive com- 
posite is made up of such elements as mental age, IQ, and scores on 
achievement tests at the beginning of the term. The statistical analyses 
utilize class averages rather than individual scores wherever appropriate 
to save computational labor. Finally, the design recognizes the possibility 
. that a teacher may be more efficient with some kinds of pupils than with 
others (e.g., upper vs. lower halves of a class) . 

Results from the use of achievement tests in the measurement of teacher 
efficiency have so far been rather inconclusive, partly because the number 
of teachers studied has been relatively small. Sometimes the number of 
teachers in the sample has been little larger than the number of variables 
employed in the predictive composite. Another weakness is the lack of 
cross-validation (i.e., the use of a fresh sample, as a check against the 
capitalization of chance in the determination of regression weights and 
the multiple correlation coefficient) . 

It scarcely needs more than mention that achievement tests do not meas- 
ure all the desirable products of teaching. Consequently, the application of 
achievement tests in the evaluation of teacher efficiency must be carried out 
with great caution and discrimination. 


Research 


Shannon (110), studying the role of various technics in educational 
research, found among other things that achievement tests were the most 
frequently occurring source of data in 1377 research studies published in 
the Journal of Educational Research. Among studies in which use of 
achievement tests in research produced results of general significance, the 
following may be briefly noted: Davenport and Remmers (28) found that 
average score on the Navy V-12 tests (which included aptitude as well as 
achievement material) for candidates from the various states correlated 
.63 with average teacher’s salary and .80 with current average per pupil 
cost in the states. Mandell and Adkins (82) found that an interpretation- 
of-data test, of the type used in the Progressive Education Association 
Eight-Year Study, correlated well with standing of administrators in the 
federal government. 

The desirability of using achievement tests as criteria for evaluating 
aptitude measures received some attention. Van Dusen (144) pointed out 
that the quality of realism obtained by measuring actual performance 
could not compensate for lack of reliability or fairness to all persons being 
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measured. Stuit (117) gave two examples of situations in which the sub- 
stitution of achievement test scores for informal ratings led to important 
shifts in validity coefficients of predictors. 


Credit by Examination 


Interest continued in the possibility of increasing educational efficiency 
thru the use of achievement tests for granting college credit. Pressey (97) 
pointed out that the current strain upon educational facilities may be eased 
by giving credit by examination. A. B. Garrett (56) indicated that sixteen 
years of experience with credit by examination in freshman chemistry at 
Ohio State University has led to satisfaction with the results. These tests 
are given to all students who have high-school credit in chemistry; from 
5 to 15 percent are given credit for one-quarter’s work in college chemistry. 
A booklet describing the achievement tests is sent to each student to aid him 
in reviewing. Programs in other colleges were described by Goetsch (57) 
and by Wickhem (153). 

Tests of the United States Armed Forces Institute have received consider- 
able attention. Detchen (30) provided a summary of information on the 
USAFI end-of-course tests, subject examinations, and General Educational 
Developmental Tests, including a list of thirty-seven references. Townsend 
(131) found that the USAFI American History Test correlated to the 
extent of .85 with the Cooperative American History Test (based on 
seventy-five cases); but that the intercorrelations of the parts were too 
high relative to the reliabilities to justify much use of the part scores. 

Studies by Dyer (39), Bradley (17), and Callis and Wrenn (19) 
support the view that the USAFI General Educational Development Tests 
are excellent as predictors of college achievement. Dyer, Bradley, and 
Frandsen found low correlations between part scores and amount of 
college training ih corresponding subjectmatter areas implying that these 
tests fall more nearly in the field of aptitude than of achievement. Frandsen 
(53) found that the General Educational Development natural science test 
scores correlated with computational and scientific interest as measured 
by the Kuder Preference Record, suggesting that the part scores may reflect 
interest as well as aptitude. 

Dressel (35) presented a summary of how the General Educational 
Development Tests are used in the awarding of high-school diplomas. He 
noted that aside from mathematics, college courses no longer require fixed 
prerequisites in terms of high-school subjects. With respect to granting 
college credit on the basis of General Educational Development Tests, he 
reported a survey conducted by Barrows in which it was found that only 
about 10 percent of colleges are giving college credit to any extent on the 
basis of General Educational Development performance. Love (79) de- 
scribed the use of the General Educational Development Test in the College 
of Education at Ohio State University. Emphasis was placed on individual- 
izing the use of the tests, with consideration being given to aptitude score 
before permitting the student to take the examination for credit. 
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Armed Services 


Chambers (20), studying possible gains for education from armed serv- 
ices training, found that one outcome on which the majority of 258 edu- 
cators with armed services experiences agreed was the desirability of more 
frequent use of achievement tests in civilian education. Tyler (142) con- 
sidered the emphasis on continuous evaluation as one of six factors con- 
tributing to the effectiveness of military and naval training programs. 

Several books describing armed services programs have devoted con- 
siderable attention to achievement testing activities. A general description 
of the achievement testing program of the Army Air Forces was presented 
by Flanagan (50). Hobbs (69) reported on a program of measurement 
using both printed and performance tests of flexible gunnery; in various 
studies, men were tested at all stages from basic school thru combat. 
Descriptions of the use of printed and performance tests in the Navy were 
presented in a volume edited by Stuit (118). The success of the armed 
services with performance testing suggests that this is a promising field for 
development. It should be noted, however, that the expense of this type of 
testing may well present a problem in civilian schools. 


Large-Scale Testing Programs 


A significant development in national testing programs was the pre- 
liminary report of the Committee on Testing appointed by the President of 
the Carnegie Foundation for the Advancement of Teaching (22). This 
committee expressed its belief that the unification of the large nonprofit 
testing agencies would benefit American education thru the elimination 
of overlapping effort and thru the availability of greater resources. In 
December 1947, the Educational Testing Service was incorporated, to 
bring together the testing activities of the American Council on Education, 
the College Entrance Examination Board, and the Carnegie Corporation 
and Foundation (46). The new organization is thus responsible, usually in 
cooperation with a sponsoring agency, for a number of large-scale national 
testing programs (47). Among these, achievement tests play a prominent 
role in: Entrance Examinations 6% the College Entrance Examination 
Board; Nation-wide High-School Testing Program; National Freshman 
Placement Testing Program; National College Sophomore Testing Pro- 
gram; College Chemistry Testing Program; Cooperative Achievement 
Tests; English Examination for Foreign Students; Engineering Achieve- 
ment Tests; Graduate Record Examination; National Teacher Examina- 
tion; and the Preliminary Actuarial Examinations. 

Other testing programs of national scope, which place substantial stress 
on achievement testing are conducted by the Educational Records Bureau; 
the Project for the Selection of Personnel for Public Accounting (136), 
associated with the Educational Records Bureau; the Pharmaceutical Sur- 
vey (100); and the testing program sponsored by the National League of 
Nursing Education (73). At the elementary level, the program for estab- 
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lishing norms on the Metropolitan Achievement Tests, in which pupils from 
every state were tested, may well be considered a national program (36). 

Anderson (5) has described the Nation-Wide High-School Testing Pro- 
gram in which, during 1946, the Cooperative Test of Recent Social and 
Scientific Developments was given to about 143,000 high-school students 
in forty-three states. Vaughn (146) discussed the various projects of the 
Graduate Record Office, including the Inquiry into Postwar Conditions 
in American Colleges. In this program, forty liberal arts colleges adminis- 
tered to every graduating senior the Graduate Record Examination Tests 
of General Education and an appropriate advanced test in the field of spe- 
cialization. Vaughn (145) also summarized the development of an achieve- 
ment testing battery for engineering sophomores. Ryans (107) gave an 
account of the testing activities of the Committee on Teacher Examinations, 
and presented descriptive statistics on the 1947 examinations. 

Among statewide programs, Smith (113) emphasized the role which 
testing had played in the activities of the Bureau of Cooperative Research 
and Field Service at Indiana University. Fox (52) described the rather 
comprehensive test services offered by Indiana University to schools in the 
state; detailed information is given on uses of tests, costs, available tests, 
and norms. 


Current Needs in Measurement 


Needs in measurement have been much discussed during this postwar 
period. A recurrent theme in these discussions is the need for testing to 
push into new fields, to develop new technics, and to exploit existing tech- 
nics more thoroly. Adkins (2), Deutsch (31), F. S. Freeman (55), and 
Wrightstone (156) have suggested specific areas needing active explora- 
tion, among the most important of which are critical thinking, originality, 
and work-study skills. Brownell (18) urged that the appraisal of learning 
in research studies should stress: (a) the process of performance; (b) the 
retention of knowledge and skills, especially as measured by the relearning 
method; and (c) the transferability of learning. In the field of intergroup 
education, where the need for broadened objectives is perhaps most acute, 
Raths (98) criticized workers who sacrifice relevance for objectivity, and 
Taba (124) advocated that very extensive observational study be carried 
out before undertaking objective test construction. 

A need for comparable tests and better norms to permit effective longi- 
tudinal studies and more satisfactory individual profiles has been stressed 
by Durost (37), Wrightstone (156), and Traxler (133). Various writers 
have expressed a need for a central agency to bring together and expand 
the stock of available information about tests, and to provide completely 
impartial advisory service to test users. The need for high standards of 
scholarship in constructing subjectmatter achievement tests has been indi- 
cated by Tolley (130) ; Palmer (94) and Packard (92) have questioned the 
scholarship of particular tests. Swenson (120) and Simpson (111) criti- 
cized the tendency toward hasty interpretation of standardized test results. 
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A related problem is one of time and budget. Lass and Wrightstone (75) 


in a survey of measurement in New York City secondary schools, and 
Painter and Painter (93) in a survey of forty-four college orientation 
testing programs, found that lack of adequate time, budget, and clerical 
help were serious hindrances. Conrad (23), while recognizing that “the 
individual is ever the source of original ideas,” emphasized the need for 
coordinated institutional resources in the execution of research. 


Trends 





A number of writers stressed the importance of measuring general out- 
comes in terms of knowledge and skills, rather than in terms of time 
served or detailed course content. Thus, Learned (76) urged that large 
goals of study measured by suitable examinations replace the present 
course-credit system; Peik (95) predicted that the general education move- 
ment would facilitate such a change. Kaulfers (72) recommended that 
foreign-language testing be oriented around general goals of achievement 
of knowledge and skill. The USAFI General Educational Development Tests 
and the Tests of General Education developed by the Graduate Record 
Office may be considered efforts in this direction. Rulon (105) counseled 
moderation in applying the “generalized outcomes” approach, urging that 
the content area covered by a test should at least be recognizable. It would 
appear that the validation of such tests should include a check on the rela- 
tive contribution of training and aptitude to score variance. 

Better understanding of the organization of human aptitudes and abilities 
should lead to sounder and more efficient testing, as pointed out by Tolley 
(130). Wrightstone (156) noted a trend toward application of factor 
analysis to achievement testing problems. Guilford (61), in an article 
stressing aptitude testing but relevant also to achievement testing, pre- 
dicted that test authors may presently be expected to report data on the 
factorial composition of new tests. 

In his review of the testing movement from 1897 to 1947, Scates (108) 
noted that the full significance of the evaluation movement has not yet 
been realized. He pointed out that while measurement implies “moreness,” 
evaluation implies “appropriateness,” and that while measurement seeks 
simplicity and homogeneity, evaluation recognizes complexity and pays 
attention to the patterning of data. Megroth and Washburne (83) advo- 
cated, as a means of facilitating the measurement of intangibles, the de- 
velopment of a school atmosphere in which the pupil will not be afraid 
of being judged. Alpern (3) found that New York City academic high 
schools were using informal records of pupil behavior, along with the more 
formal cumulative record; several schools were collecting follow-up infor- | 
mation about their graduates. 
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CHAPTER VII 


Recent Developments in Statistical Theory 


PALMER O. JOHNSON 


Mucx of statistical theory and methodology basic to the science of statis- 
tics is of recent date, appearing during the past twenty-five or thirty years. 
Close attention to and appreciation and understanding of the mathematical 
and logical foundations is necessary for the development of statistics. 
The basic research under development by the mathematical statistician 
provides the scientific capital from which fund the practical applications 
of statistical principles are drawn. There is wide variation among fields 
and among workers in any given field with respect to the appropriateness 
and efficiency of the statistical methods of analysis and designs of in- 
vestigations used. It is hoped that this review may serve to make every 
research worker more critical of his practices. 


Books 


Two books making outstanding contributions to statistical theory and 
practice were published in 1946; Cramer’s Mathematical Methods of 
Statistics (31) and Kendall’s second volume of The Advanced Theory 
of Statistics (69). The former devotes twelve chapters to mathematical 
introduction comprised mainly of the theory of measure and integration 
supplemented by other mathematical theorems and tools to make the book 
mathematically self-contained for a reader with a good working knowledge 
of differential and integral calculus. The second part of the book con- 
tains the general theory of random variables and probability distributions. 
The main part of the book, entitled Statistical Inference is devoted to 
the theory of sampling distributions, statistical estimation, and tests of 
significance. The exposition thruout is mathematical in nature but is 
illustrated by numerous examples from several fields of application. 
The main emphasis of the theoretical presentation is on the determina- 
tion of the precise conditions for the validity of the theorems, their 
connections with general probability and the logical relations among 
themselves. Kendall’s Volume II supplements his earlier Volume I to 
form together a most comprehensive exposition of the theory of statis- 
tics. Four chapters on the theory of estimation, including one on the 
derivation of the properties of the maximum likelihood estimate, and a 
chapter each to Fisher’s theory of fiducial probability and Neyman’s 
theory of confidence intervals comprise the first section of the second 
volume. The main section of six chapters is devoted to the theory of 
statistical tests and treats tests of significance, the analysis of variance, 
the general theory of testing statistical hypotheses originated by Neyman 
and Pearson, and recently developed technics of multivariate analysis. 
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The remaining four chapters are on regression, the design of sampling 
inquiries, and time series. What the author terms the logical aspects of 
statistical inference are treated very forcefully in this volume. That is, 
he presents the broad principles on which inferences are drawn from 
statistical data in terms of probability. While this book does not attain 
the mathematical rigor of Cramer’s treatment, it contributes both to those 
readers who are primarily interested in the advancement of research 
and to the practical statistician who, from its reading, will gain insight 
into the thinking and acting of the investigator when faced with the prob- 
lems of planning and interpreting scientific inquiries. The somewhat 
detailed description of these two books serves to enumerate for the reader 
the principal problems of statistical science today. 

Among other books published during the period, those written by 
Aitken (1), Hoel (56), Hogben (58), Kelley (68), Thompson (97), 
Thurstone (99), and the Statistical Research Group (27) may be pointed 
out for their general or special contribution to statistical theory and 
practice. A useful index of mathematical tables is now available (45). 


Foundations of Statistical Theory and Method 


The development of statistical theory and methodology as an exact 
science is founded in mathematical probability. The conceptual model 
constructed to deal with statistical data is grounded on probability 
theory. The axioms and the structure of theorems based on them make 
up the subject of mathematical theory. The various ways of choosing 
axioms have led to different formulations of the theory of probability. 

The chief function of statistics in scientific research is in the role of 
drawing statistical inferences, the process by which new scientific 
knowledge comes into being. The two problems of statistical inference 
are: (a) the problem of testing statistical hypotheses and (b) the prob- 
lem of estimation. 

The theory of the design of experiments has as its purpose the develop- 
ment of principles applicable to the collection of primary data in such 
a way that valid inferences can be drawn from them and for eliciting 
the greatest amount of information latent in the data in the most efficient 
way. 

Quite the most fundamental problem in practical statistics is the sam- 
pling problem. The developments occurring in the design and analysis of 
experiments have been of special significance in sampling theory and 
practice. 

This delineation of the subjectmatter of statistical science in terms of 
its major purposes and procedures serves as the basis of classification 
of the studies reviewed. Some of the studies contribute to several of the 
areas, but each is placed in that category to which, in the opinion of the 
reviewer, it contributes most directly. 
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Probability Theory 


The rapid development of statistics went forward with striking increases 
in the number and the significance of its diverse applications in many 
fields. The rapid exploration of new domains left numerous gaps in the 
theoretical foundations. However, at present and for some time in the 
past there are increasing numbers of investigations contributing to a 
more and more rigorous theoretical structure. The firm grounding of 
statistics in probability theory and the significant contributions in this 
field are clearly demonstrated in the very comprehensive survey of 
problems restricted to the purely mathematical aspects of the subject 
presented by Cramer (32). Modern statistical inference stems from the 
classical limit theorem of probability. Two fundamental papers dealt 
with this topic. Feller (39) explained the mathematical content and the 
meaning of the two most significant limit theorems in the modern theory 
of probability: (a) the central limit theorem and (b) the law of the 
iterated logarithm. Erdos and Kac (38) proved four important limit 
theorems of the theory of probability. 

Hsu (60) studied the approximate distribution of ratios of the following 


two types: (1) xand (2) Ls A knowledge of the probability distributions 


of ratios has special significance in education and psychology where 
quotients are frequently used as statistics, for example, the intelligence 
quotient. It is also important in experimental work to be able to set up 
the true average of the experimental group as a fraction of the true 
average of the control group. In this case, what is needed are the fiducial 
limits of a ratio. The procedure in setting up fiducial limits of a ratio 
between quantities having normally distributed estimates has been given 
by R. A. Fisher in his book, The Design of Experiments (Fourth Edition). 

Other contributions to distribution theory were made by Aroian (6) con- 
cerning the probability function of the product of two normally distrib- 
uted variables, by Hsu (62) on the asymptotic distributions of certain 
statistics, by Camp (23) on the effect on a distribution function of small 
changes in the population function, by Chung (24) on the maximum 
partial sum of independent random variables, and by Bhattacharyya (15) 
dealing with the distribution of the sum of Chi-squares. Bartlett (9) 
showed that orthodox probability theory may be extended to include 
probability numbers outside the conventional range. 


Sampling Theory 


While it can be stated broadly that no essentially new methods of 
selecting representative samples have been developed in recent years, 
many additions to our knowledge of underlying theory of the various — 
methods of sampling have been made. Furthermore, much development 
has taken place in technics for estimating the sampling errors of both 
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simple and complicated sampling designs. Yates (115) published a very 
comprehensive and informative paper dealing with the developments and 
applications to problems in area sampling and in economic and social 
surveys. The intensive use of sampling during the war both in economic 
and social surveys, and in operational research led to development of 
methods of sampling for handling materials at times varying greatly 
from part to part and of methods of estimating sampling errors when 
sampling units of widely different sizes were used. However, when the 
discrepancies in size are marked, estimates established on ratios or 
percents are generally required, which of necessity complicates the estima- 
tion of the errors of sampling. Random sampling (without restrictions) , 
stratified sampling (random sampling from groups), subsampling, strati- 
fication for two or more factors, and balancing are the principal sampling 
methods used. All these methods involve random selection of the sampling 
units and thereby provide exact estimates of sampling error. The analysis 
of variance, which makes possible the collating of estimates of error and the 
separation of components of error not strictly homogeneous, has pro- 
vided the basis for development of the various systematic methods of 
sampling. The principle of randomization which made possible the de- 
velopment of the various modern designs of experiments correspondingly 
made possible the development of the relatively complicated schemes 
of sampling by providing a valid estimate of error in virtue of the random 
elements in the process of selecting the sample. 

A good illustration of the modern theory and practice of sample designs 
was presented in A Chapter on Population Sampling (103). This mono- 
graph by the Bureau of the Census Sampling Staff deals mainly with 
the theory underlying a particular approach to areal sampling with 
subsampling and demonstrates the use of the “principle of optimum al- 
location by which the smallest sampling error is obtained for a specified 
expenditure.” Formulas were developed for estimating the total population, 
the variance of this estimate in terms of population parameter, expectations 
of the sample variances, and approximations to these expectations. 
Other illustrations of the problems and methods of sampling surveys 
were given by Hansen, Hurwitz, and Gurney (54), Hansen and Hauser 
(53), Jessen (64), and Mahalanobis (75). 

Cochran (25), studied the relative accuracy of systematic and strati- 
fied random sampling in a type of population in which the variance within 
a group of elements increases steadily as the size of the group increases. 
The stratified random sample was found to be always at least as accurate 
on the average as the random sample with its relative efficiency represented 
by a monotonic increasing function of sample size. No general result was 
found valid for the relative efficiency of the systematic sample. System- 
atic sampling in certain populations of the class of populations specified 
was more precise than stratified sampling for one sampling rate, but was 
less accurate than random sampling for a different sampling rate. However, 
when the correlogram was in addition concave upwards, the systematic 
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sample was on the average more accurate than the stratified sample 
irrespective of sample size. Other comparative studies of sampling methods 
were made by Madow (74) and by Brown (22). 

Gumbel (49) studied the conditions of independence of the extremes 
in a sample. Thurstone (98) considered the theory of univariate and multi- 
variate selection in factor analysis, making a comparison of the psycho- 
logical interpretations of the resulting factors. Baer (7) contributed 
to the problem of sampling from a changing population. Cases arise 
where it is not possible to take more than one sample at any given time 
and if the population changes between successive samples, the problem 
arises of estimating from a random sampling of the original population 
certain parameters of a family of populations. The stochastic limits for 
the mean, variance, and certain other statistics of the sample were deter- 
mined. Molina (78) presented a method of analysis and presentation 
for use in estimating a population fraction from a sample fraction, which 
utilizes both statistical data and collateral information. 

To the great fundamental advances in the theory and technics of sam- 
pling, most workers in educational and psychological research seem to re- 
main completely indifferent. Only two publications bearing on sampling 
problems in these fields were located. Cornell (29) applied the method of 
stratified sampling to a survey of higher education enrolment for the pur- 
pose of securing an unbiased estimate of the total enrolments of various 
types of students in six major classes of higher educational institutions. 
Marks (77) pointed out that it was difficult to find any study in psycho- 
logical research that used sampling designs which made possible the valid 
estimate of sampling error. He used the description of the sampling meth- 
ods in the revision of the Stanford-Binet Scale to illustrate how the de- 
sign could not yield statistics with measurable standard errors. Marks also 
treated the technical problem of cluster sampling in psychological re- 
search showing how cluster sampling almost always increases sampling 
error as compared with unrestricted sampling error of the same num- 
ber of cases. This is attributable to the practice of sampling previously 
existing groups of the population which involves a positive intraclass 
correlation of the variable under investigation. Since cluster sampling is an 
extremely valuable method of sampling in psychological research it is 
essential to know the conditions and means by which statistical estimates 
and the measures of their sampling error may be accurately made. 

The need is urgent for thoro studies of the efficiency of different 
sampling procedures for the different types of problems in educational 
research. 


Test of Statistical Hypotheses 


Suppose that a random variable X is the measurement of a certain 
character and that a number of repeated measurements are carried out, say 
N times. We then secure N random variables: X,, X2, , Xy. It is 
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assumed that the N random variables are independently distributed and the 
set of values comprises a sample of N independent observations on X. 
Assume that the probability distribution of X is normal but that the values 
of some parameters, 0,, , 0, specifying the population are unknown. 
Any assumption about the unknown parameters, 0,, , O, may be 
called a statistical hypothesis. Situations are often met in research in which 
hypotheses are advanced regarding the properties of the probability distri- 
butions of certain variables and we wish to test whether available statistical 
data are compatible with the hypotheses or not. A test of this general char- 
acter is called a test of significance relative to the hypothesis under consider- 
ation. The hypothesis that chance factors may have given rise to an ob- 
served effect is frequently called the null hypothesis, an hypothesis which is 
encountered in a number of different forms in research or statistical work. 
As mentioned earlier the testing of statistical hypotheses is one of the 
fundamental problems of statistical inference. The studies dealing with this 
problem are presented in five categories. 

General theory of testing statistical hypotheses—Fisher (43) discussed 
the process of reasoning involved in using observations as a basis for 
making probability statements about parameters, the knowledge about 
which is derived only from such observations. In contrast to his interpreta- 
tion of statistical tests there is the interpretation built on the frequency 
concept that the level of significance in significance tests is equal to the 
frequency with which the hypothesis under test is rejected in repeated 
sampling of any fixed population permissible by hypothesis. The latter 
interpretation is basic in the Neyman and Pearson approach in dealing 
with tests of statistical hypotheses. Pearson (83) presented as an illustra- 
tion of this interpretation the problem of testing the significance of a differ- 
ence between two proportions in a 2 X 2 table. There may be two possible 
outcomes of a test of an hypothesis, H,; either the hypothesis is rejected 
or it is accepted. Since there are two different actions with respect to testing 
H,, there are two different kinds of errors of judgment: (a) errors of the 
first kind consist in an unjust rejection of H,, and (b) errors of the second 
kind consisting in the failure to reject H, when, in fact, it is incorrect; that 
is, when some alternative hypothesis is true. To know the properties of a 
statistical test is to know the probabilities of the two kinds of error that may 
be committed in the application of the test. The property of the test associ- 
ated with the control of the second type of error is designated as the power 
of the test. The power function is also of great value in comparing the 
relative merits of alternative tests of the same hypothesis. Hsu (61) dealt 
with the power function of the E*-test and the T?-test. Johnson and Hoyt 
(65) extended the utility of the Johnson-Neyman concept of the “region 
of significance” by formulating the theoretical basis and illustrating useful 
results for a three-dimensional region. Welch (108) generalized “student’s” 
problems to include the case of testing the significance of the difference of 
means of populations with unequal variances. Wilks (112) developed test 
criteria by the Neyman-Pearson method of likelihood ratios for testing 
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simultaneously the equality of means, equality of variances, and equality 
of covariances in a normal multivariate population of K variables on the 
basis of a sample. An application of these criteria was made in testing the 
hypothesis that three “parallel forms” of a test are really parallel forms 
as far as means, variances, and covariances are the same. Grant (47) illus- 
trated the use of certain statistical tests, including the criteria of random- 
ness, in learning and problem solutions in experiments in psychology. 
Johnson and Tsao (66) developed a test of the homogeneity of a set of 
variances when correlation exists between the means and standard devia- 
tions. This test is of special value in analyzing the effect of increasing age 
on variability of traits. David (35) developed a Chi-square, smooth test for 
goodness of fit which takes into account the sign of the deviations of 
observations from hypothesis and the order of these signs. 

Other studies treated the test of significance for intraclass correlation 
when family sizes are unequal (14), the use of the range in the t-test (73), 
the use of ranking methods in testing the significance of differences in 
individual comparisons (111), operating characteristics (40), the use 
of the statistical sign test based on the differences between pairs of observa- 
tions, with a table of significance levels (36), the principle of likelihood 
for testing a broad class of statistical hypotheses (59), the significance 
of trend differences (73), and the effect of intraclass correlation on certain 
significance tests (107). Useful tables have been prepared which provide 
the means of making the most efficient test now available for testing the 
homogeneity of a set of estimated variances (96). 

Sequential tests—An important extension of the Neyman and Pearson 
ideas on testing hypotheses was evolved by Wald (106) and his associates 
(27). The method of sequential tests of a statistical hypothesis applied, 
as yet, principally to the analysis of industrial products consists in the 
application of a sequential probability ratio to the observations taken one 
at a time. After each observation one of the following actions is taken: 
(a) the lot or process under examination, or the statistical hypothesis 
under test, is accepted, (b) it is rejected, (c) judgment is suspended and 
another observation is taken. In cases (a) and (b) no further observation 
is taken. The claim made for this test, where it is applicable, as against 
the classical tests, which use a sample number fixed in advance of the 
experiment, is that it can satisfy the conditions of the test with a smaller 
average sample size often as much as 50 percent less. Sequential procedures 
are based upon conditions of randomization or where the sequence of the 
results upon inspecting the items is a random sequence. The actual pro- 
cedures in conducting the experiment determine the test to be used. It is 
worthy of note that in this process the role of statistical analysis is in 
the very process of experimentation itself as compared with its considera- 
tion at the time of designing the experiment, characteristic of modern 
experimental designs. 

A special case of sequential analysis was treated by Stein (94), who 
developed a two-sample test of a linear hypothesis, the power of which is 
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independent of the variance. Cowden (30) provided an illustration of the 
application of sequential sampling in testing the achievement of students 
in a class in statistics. 

Rank correlation—Since rank correlation is a useful technic for the 
analysis of certain kinds of data arising in educational research, investiga- 
tions of this technic are of importance to educational research workers. 
An increasing amount of attention is being given to problems of statistical 
inference which are referred to as nonparameter problems; that is, prob- 
lems in which the cumulative distribution function of the population may 
be held to be continuous but in which the function is arbitrary within 
a broad class. In these problems, ordered statistics (ranked sets of values 
in a random sample from lowest to highest values) come into prominent 
use. The rank-correlation coefficient is an example of a nonparametric 
test for two dimensions, based on the method of randomization. 

Daniels and Kendall (34) considered the problem of setting up confi- 
dence intervals for a rank-correlation coefficient in a correlated population 
and developed a test of significance for the difference between two rank- 
correlation coefficients. Hoffding (57) showed that the sampling distribu- 
tion of Kendall’s (1938) measure of rank correlation, t, tends to normality 
as 40 for any population with continuously distributed X and Y if a 
certain condition holds. Sillitto (92) treated the problem of the probability 
distribution of Kendall’s t for cases in which ties occur in one of the two 
rankings. He also constructed a table for use in showing the probability 
of attaining or exceeding an observed score value (S) by chance where 
paired or triplet ties occur. Whitefield (110) built upon the results of 
Kendall (70) in treating the case of tied rankings to the determination 
of the relation between two variables one expressed as a ranking and the 
other as a dichotomy. This problem is often encountered in determining 
the relationship of a psychological measurement and an external criterion. 

Analysis of variance and covariance—The analysis of variance technic 
developed by R. A. Fisher twenty-five years ago in connection with certain 
experimental designs in agriculture is the principal research tool of the 
biologist and is being increasingly used in the physical sciences, in engi- 
neering, and in the social sciences. Apparently its great powers are just 
beginning to be exploited. The principal purposes of the analysis of 
variance are: (a) to obtain efficient estimates of certain treatment differ- 
ences of interest to the experimenter, (b) to obtain a measure of the 
degree of confidence to be placed in the obtained estimates by means of 
estimated standard errors, fiducial limits, or confidence intervals, (c) to 
carry out tests of significance that are valid and sensitive. For the intelli- 
gent and efficient use of any statistical tool the research worker needs to 
know the assumptions underlying their proper use and how to test them. 
The conditions have not often been made clear in the textbooks or manuals, 
and it is not often that reports of researches employing analysis of variance 
procedures have provided any evidence as to whether or not their use was 
justified in the particular situation. Three useful papers, recently published, 
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have dealt with the conditions underlying the efficient use of the analysis 
of variance technic. Eisenhart (37) enumerated the several assumptions 
underlying the analysis of variance and treated the practical importance 
of each. Cochran (26) pointed out the consequences when certain assump- 
tions are not satisfied and gave important information on means of detect- 
ing the failure of the assumptions as well as on how to avoid the more 
serious consequences. Bartlett (10) treated the problem of transforming 
the original data by changing the scale of measurement in order to make 
statistical analysis more valid. He gave particular consideration to the use 
of transformations, such as the square root, the logarithmic, and the 
inverse sine transformation in applications of analysis of variance. An- 
other important paper by Bartlett and Kendall (11) dealt with the 
logarithmic transformation in analyzing heterogeneous variances. Irwin 
(63) treated the problem of interpreting the within and between class 
analysis of variance when the intraclass correlation is negative. 

The use of the analysis of variance would be greatly limited, particularly 
in education and in the social sciences generally, if, as was the original 
ease, equal or proportionate numbers of observations were required 
in the subclasses. Several important recent papers have dealt with the 
problem of unequal representation. Tsao (101) provided a mathematical 
solution for the general problem of the analysis of variance and covariance 
where there is unequal representation in the subclasses. He also presented 
certain approximate methods of practical use to research workers. Other 
contributions to this problem were made by Ansbacher and Mather (5), 
by Baten (12), by Hazel (55), by Patterson (82), and Tsao and Johnson 
(102). 

Fisher (44) gave a simple solution by the analysis of covariance method 
to the often puzzling problem of the relation between a part and the whole. 
Finney (41) showed how the precision of mean comparisons can often be 
increased by the application of analysis of covariance. 

Multivariate analysis—Modern multivariate statistical theory has given 
rise to new exact tests of statistical hypotheses in terms of probability, 
which may involve extensive multiple measurements. In 1936 Fisher intro- 
duced the discriminant function by which can be solved the problem of 
specifying an individual as a member of one of many populations or the 
classification of a number of populations based on the configuration of 
various characteristics. Mahalanobis’ generalized distance function, D*, 
can be used to measure the “distance” between sets of multiple measure- 
ments, for instance, to determine whether any two members of a constel- 
lation are closer to one another than any two belonging to different con- 
stellations. Hotelling’s generalized “Student’s Ratio” is a powerful tool 
for the discrimination of mean values between multivariate normal popula- 
tions (on the hypothesis of equal variances and equal covariances). 
Contributions to the theoretical or practical solution of problems in this 
area were made by Von Mises (105), Radhakrishna (89), Bose (19), 
Yardi (114), and Beall (13). 
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The noncentral Wishart distribution *is the joint distribution of the 
sum of squares and crossproducts of the deviations from the sample means 
when the observational values originate from a set of normal multivariate 
populations with stabile covariance matrix but with expected values varying 
from observation to observation. This distribution is the basis of obtaining 
the power function for many statistical tests in multivariate normal sta- 
tistics. Anderson (4) applied the noncentral distribution to obtain the 
moments of the generalized variance and the moments of the criterion 
for linear hypotheses when the population means lie on a line or a plane. 
Guttman (50) evolved a new approach to paired comparison and rank 
order. 

Other theoretical or applied problems involving multivariate analysis 
were presented by Jones (67), Tintner (100), Wherry and Taylor (109), 
Bittner (17), and Brogden (21). 


Theory of Statistical Estimation 


The problem of estimation is one of the cornerstones of statistical theory. 
The theory of statistical estimation treats the problem of estimating values 
(statistics) of the unknown parameters of distribution functions of specified 
mathematical form from random samples assumed to have been taken 
from such populations. The most important general method of estimation 
is the method of maximum likelihood. Interval estimation is given either 
by fiducial limits or confidence intervals. Halmos (52) discussed a neces- 
sary and sufficient condition for the existence of an unbiased estimate and 
illustrated by application to the moments of a distribution. Bhattacharyya 
(16) attempted to find a lower bound of variances of estimates of a given 
function of the parameters which is independent of the estimates used. 
Vatnsdal (104) found the position of the point about which the variance 
of the second and of the third moments is a minimum. Smith (93) con- 
tributed to the theory of estimating linear functions of cell proportions. 
Pillai (84) studied different methods of setting up the confidence limits 
for the correlation coefficient. Girshick, Mosteller, and Savage (46) pre- 
sented some theorems with applications dealing with unbiased estimation 
of the parameter P (fraction dejective) for samples drawn from a binomial 
distribution. Two studies (8, 48) dealt with relations between range and 
standard deviation. Aiken (2) derived linear approximations by least 
squares making use of the properties of the variance matrix. Winsor (113) 
developed a general principle that when possible the experiment should be 
designed so that the desired regression function can be determined directly ; 
where this is not possible the inverted regression should be used. 

Mosteller (80) proposed a number of statistical technics for the eco- 
nomical analysis of large masses of data by means of punched-card equip- 
ment. The principal technic is the use of functions of order statistics which 
promises to provide a simple and effective practical method for estimating 
parameters of normal and other populations which have a continuous cumu- 


478 























December 1948 RECENT DEVELOPMENTS IN STATISTICAL THEORY 





lative distribution function. Scheffe and Tukey (91) validated the existing 
solutions in the nonparametric case for setting up confidence intervals 
for an unknown quantile and population tolerance limits. The assumption 
involved only a continuous cumulative distribution function. 

The analysis of variance technic has been used chiefly in making tests 
of significance. Another significant use, not very well known, is in 
dealing with problems of estimation, that is, in the detection and estimation 
of components of random variation associated with a composite popula- 
tion (37). In problems of this kind the parameters involved are variances, 
the absolute and relative magnitudes of which are of chief importance. 
Crump (33) has treated very thoroly the problems involved in this use 
of the analysis of variance technic. Satterthwaite (90) developed an ap- 
proximate distribution of the estimate of variance components, based on 
the Chi-square distribution. 

Stevens (95) discussed scales of measurement with special consideration 
to psychological measurement under three phases: (a) the various rules 
for the assignment of number, (b) the mathematical properties (or group 
structure) of the resulting scales, and (c) the statistical operations appro- 
priate to measurements performed with each scale type. Cornell (28) 
discussed the characteristics of the major types of apportionment formulas. 
Mann (76) treated a problem of estimation occurring in public-opinion 
polls. He showed that the variance of the estimate in public-opinion polls 
is somewhat larger than the variance in random sampling because a cluster 
and not a random sample is used. 


Design and Analysis of Statistical Investigations 


The principles of the design of experiments were laid down by R. A. 
Fisher, and by 1926 the essentials of good experimental design and analysis 
were determined: replication, randomization, and control of variability. 
The analysis of variance technic was a simultaneous development. This 
technic supplies the appropriate method of estimating the experimental 
error and of carrying out the exact tests of significance. It is a commonly 
accepted statistical principle that the valid interpretation of a body of 
data requires a knowledge of how the data were obtained. Equally it is 
understood by the modern research worker that the conclusions drawn 
from experimental results must be based on a knowledge of experimental 
principles at all stages of an experiment. The most efficient method of 
analysis can be employed and the greatest precision secured only if the 
experiment is planned with this end in view. The principles laid down by 
Fisher have stood the test of time and are being used successfully in most 
experimental sciences. There has been some change in emphasis during 
the last two decades away from a tendency to overemphasize the importance 
of tests of significance to more emphasis on the estimates of the effects of 
treatments and the measurement of their experimental errors. There have 
been new types of design developed many of which are connected with 
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factorial design, which design includes all combinations of a set of factors 
in the same experiment. 

In spite of these developments the prototype of educational experiments 
is the single factor experiment presumably based on the doctrinaire theory 
that an experiment must be simple and apply the so-called “law of the 
single variable.” The modern principles of experimental design would be 
of special importance in large-scale cooperative experiments so much 
needed in education. The importance of planning is clearly indicated in 
the case of experiments or other observational programs, such as longitudi- 
nal studies running over periods of years. Major changes in such investi- 
gations are usually impossible after they are underway and serious mistakes 
in designs would necessitate the scrapping of the whole investigation. 

Space permits only a few studies dealing with experimental design and 
analysis to be reported. Nandi (81) considered the problem of estimating 
linear functions of unknown parameters and testing various hypotheses 
relating to them. He gave the analysis of variance of Split-Plot and Strip- 
Arrangement Designs showing that the estimates of experimental errors 
were different according to the hypotheses tested about different sets of 
parameters. Bose (20) formulated methods of attacking the problems of 
balancing and partial confounding and illustrated the actual working out 
of these processes. He also generalized certain recent results of Fisher 
with respect to the maximum number of factors that can be accommodated 
in a symmetrical factorial experiment subject to the conditions that no 
main effect or two factor interaction is confounded. Kishen (71) provided 
a comprehensive general solution for the design of experiments for weigh- 
ing and making other types of measurements. Plackett and Burman (86) 
worked out the designs for optimum multifactorial experiments in physical 
or industrial research. Plackett (85) educed certain generalizations in 
multifactorial designs. Anderson (3) reviewed the various contributions 
that have been made to the problem of missing-plot technics and derived 
some formulas for missing plots in split-plot experiments by minimizing 
the error variance. There is considerable need in educational research for 
developing technics for handling the many situations, which arise where 
data are incomplete for certain experimental subjects. Haldane (51) 
presented certain simple designs based on logical rather than experimental 
principles for the study of the interaction of nature and nurture which is 
one of the main problems in genetics. 
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CHAPTER VIII 


Computational Technics 


NICHOLAS A. FATTU 


Tue flood of computational demands brought on by the war quickly 
swamped available equipment and computers and lent considerable impetus 
to the development of more rapid computational devices. War-time restric- 
tions kept information about these devices out of circulation until recently, 
but in 1947 a host of articles began appearing in engineering and applied 
science journals, and in October 1947 the journal Mathematical Tables 
and Other Aids to Computation began a new indexing section, “Automatic 
Computing Machinery.” 

Material published during the past three years on computational technics 
may be classified in terms of its relation to: high-speed automatic vom- 
puters, mechanical computers, tabular and graphical devices and variations 
in formulas and technics. Most of these devices and changes, except the 
automatic computers, are familiar to educational research workers. 


General Bibliographies 


Besides the indexes available in educational and psychological journals, 
bibliographies and summaries may be found in Clark (27) ; George (55) ; 
Murray (108); the Fletcher, Miller, and Rosenhead index (51); and the 
Harvard Computation Laboratory Manual (75); in the Journal of the 
American Statistical Association, “Statistical Methodology Index” (22) ; 
the British Journal of Psychology, “Statistical Section”; Mathematical Re- 
views ; Mathematical Tables and Other Aids to Computation; and the Inter- 
national Business Machines Corporation Bibliography (81). 


High-Speed Automatic Computers 


New mechanical and electronic computers, developed since 1942 largely 
under the pressure of the war, handle information with great speed and 
skill no matter how long the routine may be. Two differences may be ob- 
served between these machines and conventional equipment: (a) data and 
the entire routine for solving a problem are put into the device and it 
automatically carries out the operations printing the answers as it obtains 
them, (b) information can be transferred automatically from any part 
of the machine to any other part. For example, the machine can be fed 
tables required for certain sequences in the calculation, and at the proper 
times the device will almost instantly refer to the required entry and use 
it in solving the problem. General discussions of these machines were pub- 
lished by Alt (2, 4), Archibald (6), Berkeley (13), Burks, Goldstine, and 
Von Neumann (21), Comrie (29), Duncan (41), Hartree (66, 68, 69), 
Peterson and Concordia (112), and Stibitz (121, 122). 
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At least five types of automatic computers were in operation during the 
war: the Bush Differential Analyzer, the Electronic Numerical Integrator 
and Computer (ENIAC), the IBM Automatic Sequence Controlled Cal- 
culator, the Bell Telephone Laboratories Relay Computer, and the Elec- 
tronic Discrete Variable Computer (EDVAC). 

The Differential Analyzer (5, 13, 23, 67, 112) produced under the 
direction of Vannevar Bush handles with uncanny accuracy addition, sub- 
traction, multiplication, division, integration, and the like as continuous 
operations by means of eighteen integrating devices and 130 mechani- 
cally coupled rotating shafts. Initial values and instructions are fed into 
the machine as punched holes on a paper tape, and answers come out 
typed by electric typewriters and/or as graphs. It performs integrations 
directly, works differential equations which cannot be solved by direct 
means, and gives numerical answers rapidly to five-decimal place ac- 
curacy. 

The Sequence Controlled Calculator (1, 13, 66, 67) was constructed 
thru the cooperative efforts of IBM engineers and Professor Aiken of 
Harvard. A series of ten-position relays, similar to counter wheels, handle 
the digits zero to nine. The counter positions are connected in banks of 
twenty-four so that numbers of twenty-three digits can be handled. Num- 
bers go into the machine by feeding punched cards, setting hand switches, 
or as tables punched on long paper tapes. Instructions are put in as a long 
sequence of rows of punched holes on an endless paper tape called the 
sequence control tape. Results come out punched on cards or typed by 
electric typewriters. The Harvard Computation Laboratory Manual (75) 
gives instructions for coding data for the machine. 

The ENIAC (4, 13, 56, 66, 68, 109) represents digits by pairs of rows 
of ten vacuum tubes each. In one row of the pair a tube corresponding 
to the digit recorded is turned on. In the other row all the tubes are turned 
on except the one corresponding to the recorded digit. To add unity a 
single impulse is sent thru which extinguishes the lighted tube in the first 
row and lights the next tube. Other numbers are added by sending im- 
pulses thru repeatedly. Impulses may be sent at the rate of 100,000 per 
second. The numbers handled are of ten-decimal places. Additions are made 
at 5000 per second, and multiplications at 350 per second. Numbers go 
in by direct setting of switches or by feeding cards. Results come out 
punched on cards. Instructions are put in by plugging in and connecting 
tubes—a slow and tedious process. Once plugged in the ENIAC works 
so fast that it can compute the trajectory of a projectile faster than the 
shell itself can travel thru the air (13). 

In the Bell Telephone Laboratories Relay Calculator (2, 13, 21, 121, 122) 
two-position relays are used to represent numbers. Five open and two 
closed relays represent each digit. Decimal numbers are put into the 
machine in binary form. The machine is extremely flexible and dependable, 
and complicated routines are readily handled. Besides handling the four 
fundamental processes, it can consult any of six tables of 1000 numbers 
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each; it can remember thirty numbers of seven digits each and their 
decimal points; it can refer to any of twenty-five routines; it can record 
results on paper tape and later in the same problem use them in further 
calculations. Altho this is the slowest of the machines discussed, it is 
estimated that a single machine can do the work of about 100 human 
computers. 

In order to combine the flexibility and accuracy of the Bell Relay Com- 
puter and the speed of the ENIAC, the EDVAC was developed (4). Only 
sketchy information has been published up to this time on the EDVAC. 

Womersley (142) and Berkeley (13) have emphasized that in order to 
make efficient use of these machines a new type of thinking is needed. 
Formulas and procedures selected for the ease with which they could be 
applied to pencil-paper computing are no longer adequate. Suggestions 
of such changes appear in the Harvard Manual (75). An interesting devel- 
opment in this connection is the use of “flow diagrams” described by Gold- 
stine and Von Neumann (57). These “flow diagrams” emphasize the 
logical aspects of the problem and subordinate the purely arithmetical. 
After completion of a “flow diagram,” the actual coding is fairly easy, 
tho tedious. 

Comrie (30) and Hartley (64) have recognized that these machines are 
extraordinary devices, but have also observed that their cost and compli- 
cated routine of coding data and instructions make them useful only for 
problems involving huge masses of data and intricate routines of computa- 
tion. For most practical purposes the burden of computation must be 
carried by conventional mechanical devices as it was during the war. 


Mechanical Computers 


Punched card equipment, mechanical accounting machines, and the 
various kinds of desk calculators comprise the mechanical calculators. 
Bibliographies may be found in Murray (108), George (55), and in the 
International Business Machines Corporation Bibliography (81). 

Most of the applications reported in the literature surveyed were con- 
cerned with punched card equipment. Especially prominent was the large 
number of applications in engineering and applied sciences probably forced 
upon workers in these areas by wartime demands. Herget (72) used a 
tabulator and multiplying punch to describe the equation of motion of an 
n-body problem. Bergman (11, 12) used punched card equipment to 
solve differential equations; Munk (107), Leppert (96), Kormes and 
Kormes (89) indicated aeronautical applications; Cox, Gross, and Jeffrey 
(33) illustrated applications in crystallographic analysis; Eckert (45) and 
Frear (52) described applications in chemistry. 

The printing of mathematical tables was further discussed by Eckert and 
Haupt (46). Laderman and Abramovitz (93) described the use of tabu- 
lating equipment in differencing tables, and then pointed out that in the 
U. S. Hydrographic Office Mathematical Tables Project the Sunstrand 
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Accounting Machine, Model D, was used. It efficiently checked the accuracy 
of tables by differencing and it also could reverse the operation to construct 
tables from differences. Reynolds (114) used a prepunched master deck 
in constructing tables which involved extracting numerous square roots. 

Grosch (60) described a method of harmonic analysis by use of progres- 
sive digiting. Alt (3) delineated the method developed at the Aberdeen 
Ballistic Research Laboratory for multiplying matrices by using punched 
cards. He also described the set-up for finding the inverse of a matrix by 
successive approximations. Kimball (87, 88) listed some punched card 
computational methods used by the Census Bureau, and Kempthorne 
(85, 86) indicated the uses of a punched card system for survey data. 

Advantages and disadvantages of multiple punching were summarized by 
Benjamin (10) who also suggested efficient means of coding and wiring 
to extract information. Taylor (127) described the use of an alphabetic 
punch in conjunction with coded step intervals to increase the amount of 
data punched on a card. He suggested the resolution of two difficulties, 
(a) separating the zone from the numerical punches, and (b) distinguish- 
ing between the X and the Y punch in tabulating. Bartlett (9) reported a 
process for listing scattergrams. 

Ellis and Riopelle (47) used a sorter, alphabetic tabulator, a collator 
with a card counting device, and a summary gang punch to compute 
higher moments as well as sums, sums of square, and crossproducts. 
Dwyer (42) reported formulas and procedure for correlation coefficient 
summation when there were missing variates. To eliminate information 
in other fields corresponding to the missing variate, the method used was 
an elimination field and X distributors when the missing information was 
punched X, or digit selectors when the missing information was not 
punched. 

Tucker (134) described in detail a simplified punched card method in 
factor analysis which required only a limited amount of equipment, i.e., a 
key punch, an alphabetic accounting machine with either complete pro- 
gressive totals or a summary punch, and counters that both add and sub- 
tract. An illustrative example was worked out, including complete operating 
instructions and machine set-up. The iterative trial procedure rather than 
the sign change method was used. 

Mosier (105) described the use of IBM accounting equipment in carry- 
ing out an iterative procedure for arriving at weights for a set of question- 
naire responses. The problem involved scaling 100 items on a housing 
inventory so that weights assigned would yield the most reliable over-all 
index. Weights were assigned arbitrarily and then adjusted by successive 
iterations. Three iterations were sufficient in the example reported. Kurtz 
(92) used data from life insurance agents’ rating charts punched on IBM 
cards to determine how to score the various items of information to yield 
the maximum correlation with sales and persistence. Other useful applica- 
tions of punch card equipment are reported by Cochran (28), McQuitty 
(101), Dyer (44), and Homeyer, Clem, and Federer (74). 


488 














RL AL ARE MR A NS 























December 1948 COMPUTATIONAL TECHNICS 





Berkeley (13) indicated how written information might be put into 
punch card form, and Gull (61) illustrated a punched card system for 
bibliography and indexing of chemical literature. Black and Olds (17) 
described how detailed tables of census data might be made available 
to users at a minimum cost. 

The use of punched cards was also discussed by the International Busi- 
ness Machines Corporation publications (77, 78, 79, 80). 

Calculations made directly from test scoring machines were discussed by 
Froehlich and Keller (53) and by Herfindahl (71). Epstein (48) reported 
on statistical analysis with hand punched and sorted cards. While these 
are interesting applications, the use of punched cards in handling the 
same problems seems to be more efficient and far more flexible. 

Various kinds of mechanical calculating machines were also discussed. 
The construction of calculating machines was comprehensively treated by 
Murray (108). He considered digital machines and the component parts 
of desk calculators, continuous operators, composite analogue devices such 
as differential analyzers, network analyzers, linear equation solvers, and 
mathematical instruments such as planimeters, harmonic analyzers, and 
the like. The book, however, contained nothing on large-scale discrete 
variable calculators. Berry and Pemberton (15) described a twelve- 
equation computing instrument, Bleik (18) reported on a machine for 
solving quadratic and cubic equations, and Zuse (146) designed a rapid 
but in several respects impractical computer. 

Sadler (115) and Comrie (30) expressed the belief that full exploita- 
tion of the capabilities of the commercial calculating machines was usually 
the most efficient way of dealing with ordinary problems. The coupling 
of two or more machines together, with automatic transfer of results from 
one to the other, produces a considerable increase of efficiency. Comrie 
discounted the customary value assigned to electric over-hand operated 
machines since coupling and transfer could be adapted more readily to 
the hand machines. 


Tables 


In a thoro and painstaking manner Fletcher, Miller, and Rosenhead (51) 
summarized the mathematical tables published before 1944. Part I of the 
book consists of an index of tables arranged according to function tabu- 
lated. Under each table listed there is noted the number of decimals and 
figures, the interval and range of the argument, the facilities for interpo- 
lation, and the authorship and date. Known corrections, if brief, are then 
listed, otherwise reference is made to the bibliography of Part II, where 
sources containing the detailed corrections may be located. Tables which 
met the authors’ unspecified standards of excellence are listed in bold 
type. Several tables of the normal probability integral, the ¢ and the F 
distributions are so listed, but no table of Chi-square appears in bold type. 
Part II contains a bibliography of some 2000 items, which refer to the 
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tables indexed in Part I, and to books on probability and statistics includ- 
ing applications to education and psychology. Tables developed since 1944 
are summarized in Mathematical Tables and Other Aids to Computation. 

In the literature reviewed there appeared a number of tables of interest 
to computers in educational research. For interpreting tests of significance 
several tables may be referred to. The extension of the F distribution 
table by Merrington and Thompson (103) gives entries to five-decimal 
places and probability levels extending from 0.005 to 0.50. With reference 
to the ¢ distribution, Baldwin (8) observed that the use of normal deviates 
when the degrees of freedom exceeded thirty gave results smaller than 
the true values. To provide accurate ¢ values she extended the table to 100 
degrees of freedom. Thompson and Merrington (130) reported tables 
for testing the homogeneity of a set of variances which are better approxi- 
mations than the values based on Bartlett’s test. They discussed a common 
misconception, namely that if the probability of heterogeneity is less 
than 0.05 or 0.01, the sample variances are treated as tho they were 
estimates of a common variance. In their judgment this procedure is likely 
to lead to errors of the second kind in certain instances, but these instances 
are not specified because that problem has not been investigated mathe- 
matically. Pearson and Hartley (111) presented tables for finding the 
probability that the range of sample A exceeds that of B by a certain ratio, 
and for finding the limits of the range corresponding to prescribed proba- 
bility levels. Hartley (65) similarly considered the use of the mean devia- 
tion and tabulated its integral. Baker (7) studied the distribution of the 
ratio of sample range to standard deviation for normal and combinations 
of normal distributions. Tables were presented for making tests of signifi- 
cance. A simple test of significance based upon signs of differences between 
pairs of observations was devised by Dixon and Mood (38). Illustrative 
examples and a table of significance levels were included. This appears 
to be a useful device for quick appraisals but the efficiency is probably 
low. Festinger (49) converted scores to rank order and then tested signifi- 
cance between means without reference to frequency distributions. Tables 
of the 0.05 and 0.01 levels were given. He argued that what the test lost 
in precision by conversion to rank order was compensated for by the gain 
in generality since the test could then be used with any distribution. 
Wilcoxon (141) discussed individual comparisons by ranking methods. 
Grant (59) considered some of the recent work on probability of “runs” 
and developed a table and a criterion for testing the significance of re- 
sponses in learning and problem-solving. Taylor (128) reported tables 
for determining the significance of skewness and of differences in skewness 
when expressed in terms of Fisher’s g statistics. 

Other tables of specialized application are those by Swineford (123, 
124), Jurgensen (84), Taylor and Gaylord (129), Lichte (99), Davis 
(37), Lehmer (95), Croxton and Cowden (35), and Leverett (97). 

Carter (25, 26) contrasted the effectiveness of tabular and graphical 
presentation. 
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Graphs 


Graphical aids noted in the literature reviewed included computing charts 
or abacs, alignment charts or nomographs, slide rules, and other adap- 
tations of graphs to computation. Computing chart, abac, construction 
was described by Peterson and Gulliksen (113). Crow (34) developed a 
chart of Chi-square and t which facilitates interpolation for routine work. 
Schutzenberger’s (117) abac of the sample range should enable the com- 
puter to apply tests of significance more rapidly but less rigorously than 
by using the Pearson and Hartley tables (111). 

The construction of alignment charts is described by Bond (19), 
Douglass (39), and Young (143). Hamilton (63) presented a nomograph 
for the tetrachoric correlation coefficient. Specialized graphical calcula- 
tion of statistical problems was illustrated by Levi (98), Dufrenoy and 
Goyan (40), Goyan (58), and Hayes (70). 

Some interesting slide rules were developed. Merrill (102) invented a 
slide-disk calculator for computing root mean squares which bears a strik- 
ing resemblance to the cylindrical slide rule. The most interesting and 
potentially useful slide rule for statistics is the film slide rule described by 
Stibitz (120). Each scale is printed on separate 35 mm. film about 220 feet 
long. Accuracy is obtained by using the teeth of the sprocket as the unit of 
measurement rather than the printed scale. The film simply counts the 
sprocket teeth that pass under a fixed mark, and the scale measures frac- 
tional parts of the distances between sprocket teeth. The rule has been made 
in sizes from three to ten films on each rule with appropriate mechanical 
connections between sprockets. It has been found to save 80 to 90 percent of 
the time required for computing over the use of tables and desk calculators. 

Callender (24) showed how a simple differential equation could be solved 
rapidly by using a hatchet planimeter. The ease of constructing such a 
device may stimulate thinking relative to its application to the integration 
of empirical curves. 

Fiske and Dunlap (50) described a graphical test for significance or 
differences between frequencies of different samples. Zimmerman (145) 
reported on apparatus for making orthogonal rotations by projecting co- 
ordinates from one plot to another. 


Computational Methods and Formulas 


Variations of computational procedure involved derivation of new for- 
mulas, improvements in matrix calculation and error determination, 
changes in methods of computing, and the computation of some new 
statistics. 

Guttman (62) described a method for inverting any nonsingular matrix 
by building the inverse out of inverses of successively larger submatrices. 
This is a variant of the contribution attributed to Schur. Satterthwaite . 
(116) demonstrated that the solution of a large set of simultaneous 
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equations and the inversion of matrices was complicated by errors due to 
rounding. If the norm of the matrix was less than 0.35, operations in- 
volving the inversion were in a state of error control for Doolittle calcula- 
tion. 

Berry (14) showed that the order in which the elements of the matrix 
are arranged is important. An arrangement in which the diagonal terms 
are large and the off-diagonal, especially the post-diagonal terms are small, 
favors convergence for the iterative method. Bruner (20) came to essen- 
tially the same conclusion empirically, and indicated that in the Doolittle 
solution the check column gave closer agreement if the equations were 
arranged so that the elements of the principal diagonal increased in going 
from upper left to lower right. Leavens (94) considered the same problem. 

Waugh (138) gave a simple illustration of Hotelling’s method of invert- 
ing a partitioned matrix by partitioning a square matrix of 2p rows into 
four square matrices. The inverse was also written as a partitioned matrix. 
Multiplying the original by its inverse gave four matrix equations which 
were solved for the four elements. Waugh (137) also presented a formula 
for computing partial correlation coefficients which is new and should 
save computational effort. Jenkins (82, 83) considered a systematic arrange- 
ment of computation for multiple and partial correlation. Kossack (91) 
presented a model to be followed in computing many zero-order correlation 
coefficients from a correlation matrix. Weichelt (140) considered a method 
of estimating correlation coefficients by expressing r as the ratio between 
two differences in sums of the dependent variable computed only for ex- 
tremes of the bivariate distribution. Waugh and Dwyer (139) illustrated 
compact efficient computation of the inverse of a matrix. Dwyer (43) 
described a square root or compact method for computing correlation 
and regression. (By compact he means that the operations are so designed 
that the machine used carries out many computational steps as a single 
machine operation.) This is an approximation method and subject to the 
errors discussed by Satterthwaite (116). 

Norton (110) considered calculation of Chi-square for complex con- 
tingency tables. He presented a scheme of successive approximations which 
made the computation systematic. The method provided comprehensive 
analysis of the contingency table since the interactions as well as the main 
effect were studied. 

Voss (136) described a short-cut for comparing the effectiveness of two 
methods where a number of experimental conditions were involved. The 
frequencies of differences between pairs on each condition were tabulated 
and Chi-square was computed to determine significance of the distribution 
of differences. 

Cowden (32) gave a simple illustration of the application of sequential 
sampling to an educational problem. In sequential sampling the items are 
tested one at a time and a decision is made as soon as enough data have 
been accumulated to justify the decision. One does not know in advance 
how many items will be needed. The illustration concerned a true-false 
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examination given to decide the student’s grade. Good and poor students 
could be discriminated quickly, so that the illustration was concerned 
with borderline discriminations. The goal was to reduce the number of 
questions asked of the student to a minimum and at the same time control 
the probability of passing a poor student or failing a good one. 

By use of “systematic statistics,” Mosteller (106) proposed an analysis 
for large masses of data where the cost of collecting the data was in- 
expensive compared to the cost of analysis by efficient procedures. Most 
of the work could be done with a counting sorter. Procedures were given 
for estimating the mean, standard deviation, correlation coefficient, and 
the efficiency of each estimate was discussed. 

Horton (76) indicated how large sets of random numbers might be 
obtained thru compound randomization by using a binary rather than a 
decimal system of numbers. A scheme for reducing symbol bias in shifting 
back to decimal numbers was discussed. It was proposed that electronic or 
electrical systems actuated by cosmic rays and the use of tabulating equip- 


ment seemed to be feasible for turning out large amounts of random 
numbers. 


Specific computational developments relating to factor analysis and 
other statistical technics have been discussed in connection with the com- 
puting devices rather than the technics. However, the books by Thomson 
(131) and Thurstone (132) deserve special mention. 
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INTRODUCTION 


The preparation of this special number regarding psychological research 
in or for the Armed Forces was initiated by Alvin C. Eurich, as president 
of the American Educational Research Association, and the executive com- 
mittee of the association shortly after the end of World War II. A com- 
mittee composed of individuals representing several of the various groups 
of psychologists working on military problems was appointed. The com- 
mittee agreed that such a review and bibliography could be especially 
valuable if it were comprehensive and directed attention to the wealth of 
materials which had not been made generally available and to a large 
extent probably never would be published in the professional journals. 
It was also believed that this material could be best reviewed by those 
who actually participated in the research and were familiar with the 
essential background and conditions. 

Assignments for various chapters and sections were made by the com- 
mittee and a tentative schedule established. Unfortunately, the pressure 
of preparing official reports and the extensive personnel shifts during the 
period immediately after the war necessitated numerous changes both in 
scheduling and in the responsibilities for reviewing particular materials. 
In most cases the complete reports were only available in the official files 
and the committee was therefore dependent on obtaining the cooperation 
of a small number of individuals who were attempting to carry on the 
research work initiated during the war. Under these circumstances the 
assistance of the various collaborators is especially appreciated. The appre- 
ciation of the problems and the valuable assistance of the executive com- 
mittee have been important factors in the completion of this project. 


Joun C. FLAnacan, Chairman 
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CHAPTER I 





General Reports of Research Programs 
for the Armed Forces 


JOHN C. FLANAGAN 


Derinc the period of World War II a large amount of research on psy- 
chological and educational problems was conducted in and for the various 
armed services. Because of military security measures and the pressure of 
current military duties and problems only a small fraction of this research 
was published in professional journals or otherwise made generally avail- 
able during the war. The purpose of this review is to bring to the attention 
of research workers the nature and scope of the research studies conducted 
so that the experience and findings of the wartime studies may be sum- 
marized in a single source. 

One of the first groups to become actively engaged in military research 
in the period preceding the entry of this country into the war was the 
committee of the National Research Council established at the request of 
Dean R. Brimhall, Director of Research, Civil Aeronautics Administration. 
The research in aviation psychology of this group has been reported in 
a series of Research Reports published by the Civil Aeronautics Adminis- 
tration. The findings of this series of studies have been reviewed by Viteles 
(5) and are not included in the present survey. This group, of which M. S. 
Viteles is chairman, is continuing an active program of research. It is 
now known as the Committee on Aviation Psychology of the National 
Research Council. 

A number of those most active in the early stages of the program dis- 
cussed in the preceding paragraph entered the Navy after our entry into 
the war and an Aviation Psychology Branch was established under the 
direction of the late John G. Jenkins in the Bureau of Medicine and Surgery 
in the Navy Department in Washington. The reports of this group are 
reviewed by Ames and Older in Chapter II of this survey. This work is 
continuing in the same location under the direction of Lieutenant Harry 
J. Older. 

The research program in aviation psychology reviewed by Frederick B. 
Davis in Chapter III was initiated in the summer of 1941. The research 
results of this group have been reported in a series of nineteen research 
volumes under the general title of Army Air Forces Aviation Psychology 
Program Research Reports (3). The scope of this program has been 
expanded and it is continuing under the direction of Glen Finch, Acting 
Research Chief, Division of Human Resources, Office of Research and 
Development, Headquarters, United States Air Force. 

The Adjutant General’s Office in the War Department established a | 
Personnel Procedures Section under the technical direction of Walter V. 
Bingham in the fall of 1940. The numerous research studies carried out 


529 











Review OF EpucaTIONAL RESEARCH Vol. XVIII, No. 6 





by this group have had very little circulation outside of the staff of this 
group. Therefore the review of these studies by Sisson in Chapter IV 
should be especially valuable in bringing to the reader’s attention work 
done under the supervision of Marion W. Richardson, Edwin R. Henry, 
and others who directed this work. This work is continuing under the 
direction of Donald E. Baier. 

The program on personnel research and test development in the Bureau 
of Naval Personnel did not get started until late in the fall of 1942. The 
organization was directed by Alvin C. Eurich initially. He was succeeded 
by Raymond Faulkner. The work is being continued under the direction 
of Eugene D. Carstater. The work of this group has been reported in a 
volume (4) edited by Dewey B. Stuit. It was originally planned that this 
material be reviewed by one of the group who worked in the program. 
This proved to be impossible. The reviews of the published reports of this 
group are therefore included in the miscellaneous chapter. 

Thruout the war a substantial amount of research on personnel problems 
was conducted for the Navy by the National Defense Research Committee 
thru its Applied Psychology Panel. John M. Stalnaker was chairman of 
the original committee set up to handle this work. He was succeeded by 
Walter S. Hunter when the panel was formed. This group contracted 
with various universities and other organizations to carry out specific 
research and development projects requested by the armed services. The 
reports of these groups are listed in a bibliography prepared by Bray (1). 
An official summary report has also been published in two volumes (7). 
One is on aptitude and classification, the other on training and equipment. 
Both are edited by Wolfle. Another more popularly written account of the 
work of these groups has been prepared by Bray (2). Plans for the review 
of this work by personnel participating in the program could not be carried 
out and the published reports of this group have also been included in 
the reviews of the miscellaneous materials. 

In addition to the work done under the supervision of the Applied Psy- 
chology Panel there was a substantial amount of research done by civilian 
organizations under other auspices. One of the largest of such programs 
was the work of the Psycho-Acoustic Laboratory at Harvard University 
under the direction of S. Smith Stevens. A review of the work of this group 
is given in Chapter VI. 

Another program including a number of psychologists was the service 
work in assessing candidates for assignments for the Office of Strategic 
Services. This has been reported in a recently published volume prepared 
by a group of staff members (6). 

One additional set of reports on psychological work done in the services 
during the war is to be published. This is an account of the work of the 
Morale Services Division. This work was initiated by Major General Fred- 
eric Osborn and was carried out under the immediate supervision of 
Samuel S. Stouffer and Carl I. Hovland. The four volumes reporting the 
findings of this group are expected to be available soon. 
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A number of psychologists rendered valuable services in many other 


connections during the war. Published reports of many of the studies 
done under their direction are briefly reviewed in Chapter V. It is believed 
that a small number of important research studies carried out for the 
services during World War II have been overlooked. However it is hoped 
that thru the many reports listed in this review, research workers will be 
able to benefit from most of the valuable studies carried out during this 
period. 
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CHAPTER II 


Aviation Psychology in the United States Navy 


VIOLA CAPREZ AMES and HARRY J. OLDER 


The INVESTIGATIONS reported in this chapter have been selected as repre- 
sentative of both the type and the scope of the work developed by the 
psychologists in the Aviation Psychology Branch, Bureau of Medicine and 
Surgery, Navy Department, under the direction of Captain John G. Jenkins. 
Much of the work of the Branch was of an advisory or applied nature 
which did not lend itself to written reports. Consequently, many aspects 
of the program are not in written form.* 

The development of the naval aviation psychology program up to and 
following the time of the establishment of the central office in October 
1942 may be read in several descriptive summaries (7, 8, 14, 15, 16, 
32, 40). Psychologists were originally commissioned to administer, score, 
and interpret tests for the selection of naval aviation cadets; however, the 
program soon broadened to include the development of experimental 
designs for research projects, statistical analyses, methods for selecting 
flight instructors and aircraft gunners, investigation of attrition, develop- 
ment of training aids, advisory aid to other bureaus, and research on vision 
and communication. 

The principal research groups in the naval aviation program were at 
Washington, D. C.; Pensacola, Florida; Corpus Christi, Texas; and Jack- 
sonville, Florida. The Washington group was primarily occupied with the 
administration of the program, the validation of the tests, the development 
of improved criteria, and consulting services. At Pensacola emphasis was 
on the investigation of problems of night vision training, disorientation, 
and intelligibility. Studies on fear and leadership were conducted at 
Corpus Christi. The Aviation Gunnery Group worked on the development 
of uniform curriculums for gunnery schools, improved grading systems, 
and tested special devices (7, 8, 27). 


Selection and Classification 


About a year and one half before the Pearl Harbor attack, work had begun 
on the validation of a group of tests for the selection of naval aviators. 
From the forty different tests investigated, three were selected. Each of 
these three tests was validated on groups of over 3000 cadets (44). 

The three tests originally used were the Wonderlic Personnel Test (PT), 
the Mechanical Comprehension Test (MCT), and the Biographical Inven- 
tory. In October 1942 the Wonderlic Personnel Test was replaced by the 


* The statements contained herein are the personal interpretations of the writers and are not to be 
construed as reflecting the views of the Navy Department or the naval service at large. 
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Aviation Classification Test (ACT). Two forms of this test were developed 
by the members of the Aviation Psychology Branch in such a manner as 
to give maximal spread and maximal reliability in the region of the cutting 
score (31, 23). Both forms had estimated odd-even reliabilities of over .92. 

Early investigations indicated the ability of the Wonderlic Personnel 
Test to discriminate between trainees who pass or fail (in aviation train- 
ing) among low-score groups. However, this was not true for the middle 
and upper score-range grou,s. The Personnel Test was found to be most 
valuable for predicting ground-school failures (6). Like the Personnel Test, 
the Aviation Classification Test was found to predict academic failures 
(ground-school training) fairly well, but to be of no value in predicting 
flight-training failures. Biserial correlations of .29 and .38 are reported 
for the Aviation Classification Test based on all entrants into training 
versus ground-school training failures. 

New forms of the Mechanical Comprehension Test were developed by 
the Psychological Corporation for use in the naval aviation selection 
program. The estimated odd-even reliability was .80. The test-retest coeffi- 
cients varied from .84 to .87. That the Mechanical Comprehension Test 
predicted failures for both flight- and academic-training groups is evi- 
denced by the biserial correlations presented by Fiske (6). These range 
from .14 to .43 for flight training and from .15 to .48 for ground-school 
training. 

The Biographical Inventory is a questionnaire with items on biographical 
information, interests, habits, and attitudes (6, 14, 38, 41). It was originally 
developed for use in the selection of civilian pilots, but was later adapted 
to naval aviation selection. 

The test-retest reliability was approximately .70 for a group of almost 
2000 men. The biserial correlations for the Biographical Inventory reported 
by Fiske (6) range from .15 to .40 for flight-training failures, .06 to .28 
for ground-school failures, and .21 to .36 for all failures. 

One of the most significant technical advances made in 1942 by the 
Aviation Psychology Branch was the introduction of a single index to 
represent various combinations of test scores. This index, called the Flight 
Aptitude Rating (F AR), combined the grades on the MCT and the BI (14). 

Originally, a table was constructed to show the percent of failures among 
men obtaining each of the possible combinations of BJ and MCT scores 
(18). Cells with similar percents were grouped into one of five categories 
of progressively high failure rates. Later, the scale was divided into nine 
steps to permit finer discriminations. The biserial correlation between 
pass-fail groups and the FAR was .43. Since this value was exactly the 
same as the multiple R between pass-fail and the B/ and MCT, it indicated 
that the FAR made the maximum use of differentiations provided by the 
tests (6). 

Early in the program it was found that age correlated with outcome of 
training. The younger cadets were more likely to graduate than the older — 
ones. It was also evident that extent of previous flight training predicted 
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outcome of training, but not as well as either the BJ or MCT. As for edu- 
cation, cadets with no previous flight training and less than two years of 
college showed a significantly higher percent of failures than those with 
no previous training but at least two years of college (6). 

Success in the development of technics of selection for naval aviation 
cadets suggested the feasibility of similar technics for the selection of 
flight instructors. Technical Memorandum No. 7 (39) outlined the ap- 
proach to the project. The steps, in order, were: (a) to identify two 
groups of flight instructors representing the “tails” of the distribution of 
instructor ability; (b) to determine specific characteristics which dis- 
criminate between these extremes; (c) to develop a scoring key and check 
its validity. The tests used were: PT, MCT, BI, the Aviation Preference 
Check List, the Opinions on Flight Instruction Inventory, and the Aviation 
Experience Record. The last three tests were developed expressly for this 
study. Data were completed on 905 instructors. Five types of criteria were 
established. As a result of this study the /nstructor Aptitude Rating Scale 
was devised for the selection of instructors. 

Trumbull and Vinacke (29) reported an evaluation of the Diagnostic 
Scale for Rating Flight Instructors. The scale was composed of thirty-five 
items in terms of which a student was asked to assess the merit of his 
instructor. Thirty-four instructors from two squadrons were used in the 
trial groups. The results indicated that the five degrees along the scale 
were far from equal for all questions; several questions were unsatisfactory 
in terms of consistency of the scale, but the majority of the items were 
relevant. A revised questionnaire was developed as a result of the study. 

An analysis of flight instructor selection technics was reported by 
Trumbull and Vinacke (28). Well-defined criterion groups of “good” and 
“poor” instructors were compared. Differences between the groups on com- 
ponents of selectiori tests were evaluated with the conclusion that the type 
of material used in these tests was of value in selecting instructors, but a 
majority of the items did not give the best prediction for the population 
used in this study. 

The Pensacola group worked on questionnaires for selection for 
advanced training. Many different criteria were used for selection purposes. 
One of these, low pressure tolerance, was eliminated after completion of 
Research Project R7-2 on classification tests in low pressure chamber (27). 


Training 


The aviation psychologists who were attached to the Naval Air Training 
Commands were engaged in a variety of training projects. Among their 
contributions were: (a) the development and introduction of improved 
training records, forms, and procedures; (b) aid in the preparation, 
evaluation, and revision of syllabi and training manuals for both flight- 
and ground-school instruction; (c) the improvement of testing methods 
and grading procedures; (d) statistical analyses of such factors as student 
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flow, causes of attrition, and comparison of records from different training 
stations (7, 8, 14, 15, 16, 27). 

Considerable work was done on the standardization of flight instructor’s 
vocabulary as one of the basic problems of naval aviation training. A 
technic was developed which permitted sound recordings of all conver- 
sations between instructor and student during an instructional flight. The 
apparatus consisted of a two-way electrical interphone which also served 
as a modulator for a light-weight high-frequency transmitter. Thru this 
device it was possible to “listen in” and make recordings on the ground 
of conversations in the air. These conversations were typed and studied 
in detail. From the results the “Patter” book for flight instruction was 
written (19). 

At Pensacola various analyses of attrition were made. Among the 
reports are: “Analysis of Attrition Trends in Aviation Cadets,” “Chrono- 
logical Analysis of Requests To Be Dropped from Training,” and “Analysis 
of Attrition—Primary Land Planes” (27). 

A second major function of the Pensacola group was the investigation 
of visual problems in naval aviation training. Studies of night vision test- 
ing instruments, new color testing devices, and night vision training pro- 
cedures were carried out. 

A preliminary report on “Loss of Visual Contrast Discrimination” in- 
cludes the following statement: “Loss of visual discrimination can be both 
predicted and measured under conditions of mild anoxia. The particular 
form of the test (Hecht) is unsatisfactory due to the large proportion of 
men failing to show the anoxia effect, or failing to comprehend the instruc- 
tions” (27). 

The autokinetic illusion was studied in the laboratory with light 
stationary, light and/or subject moving, and in night formation flights. 
Autokinesis is universally experienced by normal persons; the delay in 
onset with a single light is short. Movement, in one direction, lasts about 
ten seconds. A single spot is seen to move about half the total fixation time. 
The illusion is only slightly subject to voluntary control. Increasing the 
frame of visual reference reduces but does not readily abolish the illusion, 
and it is reduced by more adequate spatial localization of object, by rapid 
relative movement of the target, and by shifts in attention. “The Autokinetic 
Illusion and Its Significance in Night Flying,” by Graybiel and Clark (10) 
reported these findings. 

An investigation of the role of vestibular nystagmus in the visual per- 
ception of a moving target in the dark by Graybiel, Clark, MacCorquodale, 
and Hupp (12) is an extension of the above study. Six subjects reported 
their visual perceptions both during and following rotation while observing 
a moving target in the dark and in a lighted room. When a subject was 
accelerated to 15 rpm in the dark, there was a rapid displacement of the 
target in the opposite direction, altho, at the same time, as a result of 
nystagmus, the target appeared motionless. Following cessation of rotation 
to the right at 15 rpm the target appeared to move very rapidly to the 
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left. Following cessation of rotation to the left, the target appeared to rush 
rapidly to the right while it was displaced to the right very slowly. 

These phenomena, which did not occur in a lighted room, can be con- 
sidered as a summation of the effects of real motion of the target, vestibular 
nystagmus, and the subject’s sensation of their own motion. These effects 
have important implications in the explanation of “vertigo” in pilots. 

An analysis was also made of the concept of aviator’s “vertigo,” based 
upon personal interviews with Naval aviators by Vinacke (48). He con- 
cluded that “the term ‘vertigo’ as used by aviators covers a wide variety 
of events occurring under many different conditions of flying. The term 
‘vertigo, as used by pilots, should be accepted as referring to any sensation, 
or feeling, which does not accord with observable environmental facts.” 

The oculogyral and oculogravic illusions were studied in flight using 
three subjects who observed a fixed luminous target in the dark. Observa- 
tions were made in the rear cockpit of a standard navy training plane. 
The subject gave a running account of the apparent motion and displace- 
ment of the target while the pilot maneuvered the plane thru different 
degrees of bank (3). 

Studies from the Pensacola laboratory have demonstrated several illu- 
sions of movement which may occur in flight. Three of these, the autokinetic 
illusions, the oculogyral illusion, and the oculogravic illusion were studied 
extensively. 

Vinacke (49) reported a detailed description of the types of illusions 
reported by a large number of pilots as occurring in aircraft. The illusions 
described by the aviators were categorized into five general types: visual, 
nonvisual, conflicting sensory cues, dissociational or recognitional, and 
general emotional. 

The speech intelligibility research program at Pensacola was initiated 
in 1942. Preliminary research indicated a need for more thoro analyses of 
the factors contributing to poor intelligibility of voice communications. 
One study of 200 instructors disclosed that only 14 percent had poor phona- 
tion (loudness, pitch, quality), in normal conversation, but 80 percent 
had poor phonation under simulated flight conditions. 

The researches on speech intelligibility covered the message, the talker. 
the transmission system, and the listener. Early in the program it was 
noted that certain words have a better acoustic penetration in noise. 
A study of vocabulary used by gunners in intercommunications procedure 
revealed that some words had less than 10 percent intelligibility value. 
Another observation revealed that long words had higher intelligibility 
value than short words (45). 

Speech technics have been developed to improve aerial voice communi- 
cations. Two types of transmission systems have been studied: (a) the 
Gosport (acoustical) and (b) the Radio (electrical). Intensive studies of 
the Gosport speaking type system led to modifications which improved 
intelligibility of voice transmission. Microphones, earphones, and oxygen 


masks have been studied, also. The speech laboratory has developed various 
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methods to check listening ability during flight. It found little relationship 
between audiometric examination results and listening ability in noise 
(22, 23, 45, 46). 

Steer, Lawrence, and others (24) reported an evaluation of the Gosport 
speaking tube. Flight and laboratory tests were conducted to evaluate 
the relative advantages of the old and new Gosport. An experimental feed- 
back system which allowed the instructor to hear himself talk to the student 
was also tested. 

The speech intelligibility training program was described in detail by 
Steer and Hadley (23). They also gave a bibliography of the research 
projects completed in the laboratory. 


Measurement of Proficiency (Criteria) 


The Washington group early recognized the lack of systematic treatment 
of the criterion-to-be-predicted problem. They concerned themselves with 
efforts to establish methods of collecting and recording criterion data, with 
the investigations of factors influencing the reliability and validity of the 
criterion and with the development of technics of analysis. Jenkins (17) 
in a recent article summarizes the thinking of the group on the more 
pertinent aspects of the problem. 

The principal criterion used for the validation of the various selection 
devices was outcome-of-training (the award of the “wings” or the dismissal 
from training). Outcome-of-training was further refined into reason-for- 
failure, such as ground-training failures, psychologically unsuited, dropped 
at own request, etc. These criteria naturally, were neither highly reliable 
nor valid for the prediction of combat pilot success (6, 14, 17). 

The first attempt to obtain combat criterion data was made by four 
naval psychologists who interviewed pilots with combat experience as they 
returned to the United States. Approaches considered and/or attempted 
were: (a) to determine what characteristics were important in meeting 
combat-requirements, (b) to obtain ratings or rankings of all members 
of an air group, (c) to use decorated versus undecorated pilots, and (d) to 
use number of planes shot down. It was finally decided to attempt to 
identify men regarded by fellow pilots as either definitely wanted or 
definitely not wanted as a member of their combat team. 

A member of the Aviation Psychology Branch was sent to the Pacific 
area to develop basic methods of obtaining combat criterion data. The 
“high” nominations were sought by asking the respondent to name two 
men of his acquaintance (living or dead, regardless of rank) on whom 
he would most like to fly wing in combat. Nominations for the “low” group 
were obtained by asking him to name two men whom he would not like 
to have flying wing on him in combat (47). 

Further nominations were collected by one psychologist at a west coast 
port and by four psychologists in the Pacific area. Over 800 respondent pilots 
with approximately 1600 high and 1600 low nominations were contacted. 
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The free-response data were later categorized for coding. The original 
thirty-three unit-categories were reduced to twenty-six. These twenty-six 
categories when sorted formed five category-clusters (33, 34, 42, 43). 

From these categories two checklists were constructed, one for the high 
or “wanted” pilots and one for the low or “unwanted” pilots. The checklist 
method yielded 2872 respondents with a total of 4325 nominated pilots. 
Of these nominees, 2267 were nominated as highs, 1832 were nominated 
as lows, and 226 pilots were nominated for both high and low by different 
respondents. The fact that so few pilots received conflicting nominations 
is taken as evidence of the validity of the nomination technic (35, 36, 37). 

In a report of the Combat Criterion Project to date, Carroll (1) reported 
on the preliminary work, the technic for coding free-response materials 
into categories, the use of sociometric diagrammatic technics, experimental 
design, and nature of the population. 

An incidental investigation was made of the relationship of frequency 
of response to importance of response. The results indicated considerably 
less than a perfect correlation (43). This may have implications for 
future research. 

Trumbull and Vinacke (30), concerned with the problem of a criterion 
for the validation of flight instructor selection tests, used student evalu- 
ations of their flight instructors to establish criterion groups for analysis 
of selection data. The agreement among six criteria of success was deter- 
mined, and the 20 percent of instructors rated best and 20 percent rated 
low were isolated. The six criteria showed agreement. Using a composite 
of these six, the extremes of flight instructors were defined. 

Another approach to the criterion problem was made in the validity 
study of five targets for testing visual acuity thru the correlation of the 
test results of each target with the Grow Chart scores. In addition, the test- 
retest reliabilities of all six tests were studied. Acuity scores, obtained in 
Snellen equivalents, were translated into log-units to facilitate statistical 
analysis. Additional systems were assayed for scoring each of the Randolph 
Field tests (26). 

Estimates of the reliability of the Verhoeff test of depth perception were 
computed in a test-retest study (25). Four scoring methods were studied 
for their relative reliability and discrimination between levels of depth 
perception. 


Attitudes, Morale, and Leadership 


The Corpus Christi group became interested in basic emotional and 
social problems. One product was a discussion of the psychology of fear 
with emphasis on how to counteract it. A survey of attitudes and informa- 
tion regarding the war was made. The problem of leadership and organiza- 
tion in patrol plane crews was brought out for examination and treat- 
ment (8). 

A preliminary questionnaire study was made of the feasibility of using 
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the nominating technic at preflight schools for evaluating leadership and 
associated qualities of aviation cadets and student aviation pilots. Two 
questions were asked: “What two men in your present platoon would you 
select as leaders for the new one?” “What two men in your present platoon 
would you least desire as leaders of the new platoon?” (31). 


Tabulating and Analysis Technics 


Much of the work of the naval aviation psychologists consisted of the 


development of technics of analysis. Unfortunately little of this work has 
been put into written form. 


At Pensacola, Graybiel, Clark, and MacCorquodale (11) reported a 
method for observing and reporting the effect of angular acceleration and 


66.99 


variations in “g” on visual perception during flight. The visual stimulus 
was a collimated “star” installed in the rear cockpit of a standard navy 
training plane. All observations were made in complete darkness. Both 
the pilot’s and observer’s verbal reports were dictated into an airborne 
wire recorder which also provided a time limit. These recordings were 
transcribed in the laboratory, and all analyses made from them. Fiske and 
Dunlap (9) presented a graphical test for the significance of differences 
between frequencies from different samples. 
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CHAPTER III 


Psychological Research in the AAF Aviation 
Psychology Program 


FREDERICK B. DAVIS 


Tue reports of psychological research reviewed in this chapter were 
written by military and civilian personnel of the Army Air Force Aviation 
Psychology Program. These reports are all in published form; originally, 
the reviewer had hoped to include unpublished research reports (of which 
hundreds are on file), but several considerations made this inadvisable. 
In the first place, the nineteen AAF Aviation Psychology Program Research 
Reports which were listed by Flanagan (27) in¢lude most of the important 
research findings that were presented in the unpublished documents and 
that are not subject to restrictions for security purposes. In the second 
place, the task of reviewing the unpublished materials and assigning credit 
for the research reported in them proved to be prohibitive. 

In addition to the officially approved reports and articles reviewed, this 
chapter includes a few others that present results of research conducted in 
the AAF Aviation Psychology Program. 

Two articles that were written by personnel of the AAF Aviation Psy- 
chology Program about aviation psychology in enemy countries indicated 
clearly that in this field the air forces of the United States and its allies 
were far ahead of the German and Japanese air forces. Fitts (22), who 
served as official representative of the AAF Aviation Psychology Program 
on a mission to Germany for the purpose of studying the technics and 
procedures used by German Air Force psychologists during the war, re- 
ported that concepts of objectivity, standardization, reliability, and validity 
were almost completely disregarded by the German psychologists. So far 
as could be determined, no contributions to technic were made that would 
be of value to American psychologists. Geldard and Harris (35) visited 
Japan in November and December of 1945 to assess the work of psychol- 
ogists in the Japanese Air Forces. They found that both the Japanese Army 
and Navy Air Forces used batteries of paper-and-pencil tests and psy- 
chomotor tests to select men for pilot training. In general, it is interesting 
to note that, so far as aviation psychology is concerned, Japanese psychol- 
ogists seemed to be far more advanced than their German counterparts. 

It is not generally known how considerable were the contributions of 
the AAF Aviation Psychology Program to air-crew selection procedures 
employed by the Royal Air Force, the Royal Canadian Air Force, the 
Australian Air Force, and the South African Air Force. Special air-crew 
classification batteries were actually designed for use in the French, 
Chinese, and Philippine Air Forces. Lyerly (70) has discussed the prepa- 
ration and use of these batteries in some detail. 
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Organization and Development of the 
AAF Aviation Psychology Program 


In Report No. 1 of the AAF Aviation Psychology Program Research 
Reports, Flanagan (25) described the development of the program, its 
main findings and accomplishments, and their implications for psychology 
and education. Following a brief introduction (25, Chapter 1), he pre- 
sented the historical background (25, Chapter 2) essential for an under- 
standing of the program and quoted official directives concerning it (25, 
Chapter 3). The objectives of the AAF Aviation Psychology Program were 
stated in 1943 and again in 1945 in articles in the Psychological Bulletin 
(81, 82). The organization and personnel of various research units were 
also discussed in these articles. Thorndike (101) has summed up the 
psychological research work in the AAF Aviation Psychology Program 
under two headings: first, the development and validation of tests for use 
in selecting and classifying air-crew personnel; and second, the solution of 
problems required to maximize the combat efficiency of personnel. 

Activities of psychologists in the AAF Training Command and some of 
the results of their work were described in an article prepared by the staff 
of the Psychological Section, Headquarters, AAF Training Command (89). 
DuBois (18, Chapter 2) outlined the location and functions of the psycho- 
logical units in the Training Command and Gilmer and Preston (37, 
Chapter 8) mentioned some of the administrative problems encountered 
in their operation. Simon and Berwick (96, Chapter 16) provided informa- 
tion concerning the special services performed during the war by the 
Statistical Unit of the Psychological Branch in the Headquarters of the 
Training Command. 

In addition to units in the continental United States, several detachments 
of psychologists were sent overseas for temporary duty. The histories and 
objectives of these detachments and of other missions undertaken abroad 
by members of the AAF Aviation Psychology Program were summarized 
by Lepley (66, Chapters 1, 2). In general, the detachments obtained combat 
validation data for test scores, made analyses of combat requirements, 
studied the aptitudes required of lead-crew personnel, and developed 
proficiency measures for air-crew specialties. 


Selection 


In the first published account of work in the AAF Aviation Psychology 
Program, Flanagan (28) reported the initial steps in developing a test for 
selecting air-crew members in the AAF. This test, first called the Aviation 
Cadet Qualifying Examination and later the AAF Qualifying Examination, 
was further described in subsequent publications (25, Chapter 4; 80), 
and particularly in a volume edited by Davis (14). The latter traced the 
development of the AAF Qualifying Examination over a period of four 
years (14, Chapter 1), described the research work underlying its develop- 
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ment (14, Chapter 3), and made a general evaluation of its usefulness to 
the Army Air Forces (14, Chapter 12). The principles employed in con- 
structing this Qualifying Examination were set forth in some detail (14, 
Chapter 2) and should prove of interest to technicians confronted with 
the problem of assigning individuals to “accepted” or “rejected” groups 
without regard to individual differences within each group so obtained. 
The Qualifying Examination was constructed to serve a particular purpose, 
tho it found many uses (14, Chapter 4). 

In seven successive chapters of AAF Aviation Psychology Research Re- 
port No. 6, Davis (14) reported research on many kinds of test items tried 
out for use in the Qualifying Examination. Of three types of verbal items, 
reading-comprehension items were most useful for predicting graduation 
or elimination from pilot training in the AAF (14, Chapter 5). Factorial 
studies suggested that word knowledge and reasoning in reading are two 
important skills involved in reading. This result agreed with prewar studies 
by Davis. Successful efforts to develop tests of factual information that 
measure interests significantly related to graduation or elimination from 
pilot training in the AAF were described (14, Chapter 6). The technics 
employed should prove applicable to the construction of tests for educa- 
tional and vocational guidance. 

Objective test items that measure judgment and reasoning were found 
by Davis (14, Chapter 7) to be factorially complex. Several reasoning 
factors were identified, one of which was significantly related to gradua- 
tion or elimination from pilot training in the AAF. A mental skill believed 
to be peculiar to what is known as “judgment” was determined and named 
“evocation,” the ability to call relevant information to mind. 

The most useful items for predicting performance in pilot training were 
said by Davis (14, Chapter 8) to be mechanical-comprehension items. 
An investigation revealed that their variance could be accounted for almost 
entirely by four independent factors. The design of the factorial study was 
novel and should be of interest to students of factorial analysis. The use- 
fulness of twenty types of machine-scorable perceptual-test items for pre- 
dicting graduation or elimination from pilot training in the AAF was 
discussed by Davis (14, Chapter 9), and methods for ascertaining their 
efficiency in combination were outlined. Other types of items for which 
validity data were reported by Davis (14, Chapter 10) included mathe- 
matics items, interpretation-of-data items, and printed psychomotor items. 
The latter were especially recommended for additional research. 

The Victory Corps Aeronautics Aptitude Test which was widely dis- 
tributed by the U. S. Office of Education was constructed under Davis’s 
supervision and was described by him (14, Chapter 11). General pro- 
cedures for devising and refining aptitude-test forms were discussed by 
Thorndike (101, Chapter 3). 

The psychological research on the selection and training of bombardiers 
in the AAF that was accomplished prior to the establishment of Psycho- 
logical Research Unit (Bombardier) was summarized by Johnson (55, 
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Chapter 2). Research on the selection of instructors for bombardier schools 
was reported by Larson (64, Chapter 7), including data that indicated 
substantial validity for the Instructor-Selection Stanine. Melton presented 
(73, Chapter 23) information concerning the validation of eight apparatus 
tests against criteria of performance in bombardier training. 

McClelland and Dailey discussed the correlations of twenty-two scores 
derived from the Air-Crew Classification Battery with five criteria of pro- 
ficiency in flight-engineer training (71, Chapter 5). Intercorrelations of 
a number of tests constructed especially for selecting flight engineers and 
tests in the Air-Crew Classification Battery were also reported. A com- 
plete report on the problem of selecting flight engineers was made in the 
volume edited by Dailey (13), including a summary of the research up 
to 1946 and suggestions for future work in the field (13, Chapter 6). 

Six phases of the problem of selecting gunners were described by 
Stolurow and Schrader (98, Chapter 6). Difficulties in obtaining satis- 
factory criterion variables and practical limitations that prevented the 
elimination of more than a small proportion of trainees were major handi- 
caps. Schrader, Pascal, and Valentine reported the development of a 
selection test for gunnery officers, which showed a significant positive cor- 
relation with performance in the Combat Gunnery Officers Course (93, 
Chapter 13). The selection and training of instructors in schools for 
flexible gunners were discussed by Stolurow, Irion, and Pascal (98, 
Chapter 12). After consideration of a number of possible criteria, gun- 
camera scores were chosen for use in experimental studies reported by 
Melton (73, Chapter 21) on the selection of flexible gunners. 

Tests used to select men for navigator training and the research data 
pertaining to them were described by Carter and Michael (6, Chapter 3). 
The instruments devised to predict performance as an instructor in navi- 
gation schools were considered by Zielonka, Rust, and Rosemark (114), 
together with data regarding their effectiveness in measuring specified 
criterion variables. d 

A series of studies relating to the selection and evaluation of instructors i 
in pilot-training courses in the AAF were reported by Galt and Grier 
(34, Chapter 14). Work on the prediction of performance in pilot training 
is reviewed in this chapter in connection with the Aviation Cadet Qualifying 
Examination and the Air-Crew Classification Battery. 

The history of research work on the selection of radar observers was 
written by Kunsman (61) and the validation of selection tests for radar- 
observer training courses was discussed by Kelley (57, Chapter 11). 
Multiple correlations (subject to shrinkage) of .36 to .50 with criteria 
consisting of course grades were obtained. Apparatus tests administered ; 
at Langley field showed, according to Melton (73, Chapter 22), no signifi- ; 
cant correlations with any one of four criteria of success in radar training. : 
Intercorrelations of the tests, obtained at Carlsbad Army Air Field, 
were low. 

Mollenkopf and Chaplin reported the design, construction, and use of 
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tests for selecting instructors in the AAF Personnel Distribution Com- 
mand (77, Chapter 2). Several weighted composite scores (stanines) were 
derived from these tests. Descriptions of motion-picture tests constructed 
for aptitude measurement by the Psychological Test Film Unit were 
presented by Lamkin, Schafer, and Gagne (63, Chapter 5). The tests gen- 
erally displayed low positive correlations with graduation or elimination 
from pilot training and contributed so little to the prediction of that 
criterion that the expense of using them for practical purposes could 
not be justified. 

The most rigorous study of the prediction efficiency of the procedures 
used in the AAF Aviation Psychology Program for selecting men for pilot 
training was designed by Flanagan (26). The study was unique and 
should prove invaluable to students of mental measurement. Thorndike 
reported the detailed results of the study (100, Chapter 5), which was 
based on the records of a large sample of applicants for pilot training 
who were admitted to training regardless of their scores on the Aviation 
Cadet Qualifying Examination and the Air-Crew Classification Battery. 
Case studies were made of sixteen men who obtained low scores on the 
selection tests and yet succeeded in completing pilot training and on 
fifteen men who obtained high scores and failed in pilot training. Walton 
presented two of these case studies as illustrations (105, Appendix C). 


Classification 


As explained by Flanagan (25, Chapter 4) after initial selection by 
means of the Aviation Cadet Qualifying Examination, men accepted for 
air-crew training were classified for specialized training as pilots, bombar- 
diers, navigators, gunners, etc. Flanagan outlined the essentials of the 
classification problem and mentioned the efficiency in the utilization of 
personnel that can be secured by differential classification. In a volume 
edited by DuBois (18) the classification program in the AAF was explained 
in detail. DuBois (18, Chapter 1) recounted the history of and plans for 
the classification testing of aviation students; in collaboration with 
Preston, he described the composition of the air-crew classification batteries 
and certain statistical data derived from their use (18, Chapter 3). Exten- 
sive data concerning the validity of stanine scores derived from successive 
classification batteries were reported by DuBois, Preston, and Peltier (18, 
Chapter 4). A description of group testing in AAF classification centers 
and a discussion of the standardization of testing procedures were presented 
by Gilmer and Preston (37, Chapter 2; 37, Chapter 1). The authors like- 
wise described the personal interviews with aviation students and the 
criteria used in recommending them for types of air-crew training (37, 
Chapter 7). 

Articles concerning the personnel and organization of Psychological 
Research Units 1, 2, and 3 appeared in the Psychological Bulletin (86, 
87, 88). Research activities of the units were also described briefly. 
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The development of tests for air-crew classification has been summarized 
in volumes edited by Guilford and Lacey (43) and by Melton (73). Tho 
these tests were designed for use in the classification battery, the criterion 
for judging their value was their contribution to the prediction of per- 
formance in one or more air-crew specialties, as pointed out by Humphreys 
(52, Chapter 2) in a discussion of the program of printed-test develop- 
ment. It is reasonable to suppose that quite different judgments of value 
would have been made had the criterion for judging value been a test’s 
contribution to predicting only that part of an air-crew specialty not present 
in other specialties for which performance was to be predicted. Yet it is 
this type of differential prediction that is the crux of the classification 
problem. In practice, therefore, the Air-Crew Classification Battery served 
as a multiple selection test among men initially selected by means of the 
Aviation Cadet Qualifying Examination. 

Following an introduction to tests of intellect and information prepared 
by Humphreys (52, Chapter 4), Mock described tests of verbal ability 
(76, Chapter 5), Davis presented data concerning mechanical tests (16, 
Chapter 13) and mathematics tests (16, Chapter 6), and Fruchter reported 
the findings regarding a trait called judgment (31, Chapter 8) and the 
development of information tests (31, Chapter 14). Lacey and Tait 
reviewed research work on reasoning tests that were not incorporated in 
the Air-Crew Classification Battery (62, Chapter 7) and Zimmerman pre- 
sented data concerning tests of visualization and offered hypotheses regard- 
ing the mental traits measured by these tests (115, Chapter 12). The 
construction of measures of foresight and planning, and data pertaining 
to their factorial composition, were discussed by Guilford and Mock (43, 
Chapter 9). These authors also reported the development of tests of 
integration, the latter being defined as the ability to pay attention to 
several variables simultaneously and to respond to a combination of them 
(43, Chapter 10). Research on memory tests was reviewed by Lipman, 
Patterson, and Shirley (67, Chapter 11). Evidence of the existence of 
three independent factors thought to represent aspects of memory ability 
was adduced. 

The outline of plans for constructing perceptual tests was provided by 
Lacey (62, Chapter 15). Zimmerman discussed the development and 
factorial composition of perceptual speed tests (115, Chapter 16) while 
Lacey described the printed tests of form perception developed for possible 
use in the Air-Crew Classification Battery (62, Chapter 17). The nature 
of eleven tests of size and distance was considered by Lacey and Shirley 
(62, Chapter 18) and their value as tests of pilot aptitude was mentioned. 
Lacey and Niehaus (62, Chapter 20) reported efforts to measure the 
ability to determine one’s location relative to landmarks, while tests 
designed to measure other spatial abilities were discussed and their 
factorial content hypothesized by Howe and Zimmerman (51). Fruchter 
(31, Chapter 21) described experimentally developed measures of set and 
attention. 
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The general approach to the problem of organizing and presenting 
material for testing emotion, temperament, and personality was outlined 
by Guilford (43, Chapter 22). According to Cerf (8, Chapter 23), per- 
sonality inventories and questionnaires that were commercially available 
in 1942-1945 failed to yield scores significantly related to performance in 
pilot training in the AAF. Furthermore, Cerf concluded (8, Chapter 24) 
that predictions of such performance made by clinicians on the basis of 
sets of test scores and subjective judgment were of little or no value. A 
description of the biographical data blank adapted by the AAF from the 
form used by the Civil Aeronautics Administration and the Navy Bureau 
of Medicine and Surgery was provided by Mock (76, Chapter 27), who 
presented evidence of its value. Measures of specific traits of temperament 
that were developed or tried out in the AAF Aviation Psychology Program 
were discussed by Davis (16, Chapter 25). Grossman presented data (41, 
Chapter 26) concerning tests of motivation. 

One of the most interesting fields of investigation of the AAF Aviation 
Psychology Program was that of mass testing with apparatus tests. The 
history of the development of these tests was recounted by Melton (73, 
Chapter 1), who has discussed the problems arising in the course of the 
unprecedented use of apparatus tests and the technics devised to cope with 
these problems (73, Chapter 2). Melton has summarized (73, Chapter 25) 
the conclusions reached on the basis of over four years of intensive research. 
He has also discussed technical considerations, such as methods of deter- 
mining reliability coefficients for apparatus tests and of obtaining suitable 
criteria for validating them (73, Chapter 3). The mechanics of testing 
large numbers of aviation students with psychomotor apparatus were 
explained by Gilmer and Preston (37, Chapter 3). 

Among the standard classification-battery tests for which Melton has 
provided detailed specifications and elaborate data concerning their 
reliability and validity were the SAM Complex Coordination Test (73, 
Chapter 4), the SAM Two-Hand Coordination Test and the SAM Two-Hand 
Pursuit Test (73, Chapter 5), the SAM Discrimination Reaction Time Test 
(73, Chapter 6), the SAM Rotary Pursuit Test and the SAM Rotary Pursuit 
Test With Divided Attention (73, Chapter 7), the Rudder Control Test 
(73, Chapter 8), the Santa Ana Finger Dexterity Test (73, Chapter 9), 
six tests of steadiness designed to measure the effect of emotional stress (73, 
Chapter 10), and two Pedestal Sight Manipulation tests intended to 
select men for training as B-29 gunners (73, Chapter 11). It was found 
experimentally that the psychomotor tests in combination made a signifi- 
cant contribution to the prediction of such criteria as performance in pilot 
training obtained from the use of paper-and-pencil tests alone. One of the 
questions left unanswered by research completed during the war was 
whether paper-and-pencil tests could be developed to the point where the 
unique contribution of apparatus tests would be too small to warrant the 
expense of developing and administering them. 

In addition to the apparatus tests actually employed in the Air-Crew 
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Classification Battery, Melton has listed many others that were still in the ‘ 
experimental stage at the end of the war. He has presented as much data ‘ 
about these as can be released under security restrictions. The tests in- k 
cluded six designed to measure compensatory visual-motor reactions (73, ( 


Chapter 12), six that measure visual-motor pursuit skills (73, Chapter 13), 
four path-tracing tests together with variations of three of these (73, Chap- 
ter 14), several coordination tests (73, Chapter 15), and nine visual 
discrimination-reaction tests (73, Chapter 16). Others were seven timing- 
reaction tests (73, Chapter 17), twelve manipulation and motility tests 
designed to aid in the selection of bombardiers and radar operators (73, 
Chapter 18), eight stress tests, one of which (the Falling Hammer) was 
validated against combat criteria by a detachment in England under the 
leadership of Lieutenant Colonel Paul Horst (73, Chapter 19), a large 
number of psychophysiological measures developed and studied extensively 
by M. A. Wenger (73, Chapter 19), and eight miscellaneous tests including 
measures of kinesthetic discrimination, foresight and planning, muscular 
coordination, sway compensation, and stability of orientation, as well as 
the AAF Physical Fitness Test, the SAM Control Sequence Memory Test, 
and the Minnesota Assembly Test (73, Chapter 20). 

Because of the need for placing air crews of the highest quality in lead 
planes, considerable research was undertaken to measure the abilities 
required of men in lead planes. This was summarized by Lepley (66, 
Chapter 9). 


Pe ee ee ee oe ee. | 


Training 

Research on various aspects of training in the AAF was presented by 
Flanagan (25, Chapter 6). He discussed the content of training courses, 
the amount and rate of learning that took place, and the evaluation of 
training devices. The selection of instructors was also considered. Thorn- 
dike mentioned some of the problems of training experiments (101, 
Chapter 10). 

Most of the research work on training problems was undertaken by the 
AAF Aviation Psychological Research Projects at Training Command 
installations. An account of the history, organization, and research 
activities of the Psychological Research Project (Bombardier) was pre- 
sented briefly in the Psychological Bulletin (83) and in considerable 
detail in the volume edited by Kemp and Johnson (58). The latter wrote 
a brief background history of the training of student bombardiers and 
of instructors for bombardier training schools (55, Chapter 1), calling 
attention to the fact that over 47,000 bombardiers were trained in the 
AAF between the attack on Pearl Harbor in 1941 and the surrender of 
Japan in 1945. He also outlined the organization and mission of Psycho- 
logical Research Project (Bombardier) of which he was Assistant Director 
(55, Chapter 3). Kemp and Helmick reported an experimental study 
designed to show the improvement in circular error resulting from in- i 
creasing the number of bombs dropped during the bombardier training 
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course (58, Chapter 8). Johnson (55, Chapter 10) summarized the work 
of the Psychological Research Project (Bombardier) and together with 
Kemp offered suggestions for future research in aviation psychology (58, 
Chapter 11). 

Psychological research concerning the selection and training of flight 
engineers began at Psychological Research Unit No. 2; later it was centered 
in the Psychological Research Project (Flight Engineer) at Hondo, Texas. 
The research projects undertaken and the trends that influenced their 
choice were outlined by French, McClelland, and Dailey (29). According 
to McClelland, Canfield, and Dailey, flight-engineer training was begun in 
April 1943 when excessive losses of bombardment aircraft on long over- 
water flights demonstrated the need for an air-crew member trained to 
operate engines at optimal power settings (71, Chapter 5). 

The psychological research carried out on flexible-gunnery training was 
reported in AAF Aviation Psychology Program Research Report No. 11, 
edited by Hobbs (50). He has written a brief history of the training of 
flexible gunners (50, Chapter 1), has pointed out the role of psychologists 
in the training program (50, Chapter 4), and has made a critical evaluation 
of the contributions of psychological research to gunnery training (50, 
Chapter 15). With Schrader (50, Chapter 11), he explained how psychol- 
ogists prepared curriculums, lesson plans, manuals, etc., for the training 
courses, formulated principles of program planning, and systematically 
evaluated the training programs. A description of the typical gunner in 
the AAF was written by Pascal (79, Chapter 3); the gunner was said to 
be about twenty-three years of age, a high-school graduate, and about 
half a standard deviation above average in mental ability. His motivation 
during training was not good. A description of several training devices 
used in flexible-gunnery training was given by Vallance and Schrader 
together with evaluative information pertaining to them (103, Chapter 9). 

The establishment of the Psychological Research Project (Navigator) 
in the AAF Training Command was described by Carter (6, Chapter 4) 
and a list of the personnel attached to it was presented (84). A complete 
account of the work of the project was made available in the research 
report edited by Carter (6), who also prepared a summary of psychological 
research in navigator training with suggestions for future planning and 
research (6, Chapter 13). Michael outlined the role of the navigator in 
the AAF, the selection of men for navigator training, and research in the 
problems of navigator training (74, Chapter 1). Suggestions regarding 
the length and arrangement of the content of the course in navigation 
resulted from a study of dead-reckoning navigation that was made by 
Dudek (19, Chapter 8). With Glaser, Dudek also reported the nature and 
results of a rigorous evaluation of a special training aid used in navigation 
training in the AAF-trainer, the so-called G-trainer. Important methodo- 
logical implications may be derived from this study. 

In AAF Aviation Psychology Program Research Report No. 8, edited 
by Miller (75), psychological research was reported on objective measures 
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of flying skill, printed tests of flying information, subjective measures of 
flying proficiency, job analysis, and instructor selection and evaluation 
(75, Chapter 1). Prior to this, the functions, history, and personnel of 
the Psychological Research Project (Pilot) had been listed briefly and its 
research activities discussed at some length (85). Ericksen outlined the 
organization of the AAF Training Command and briefly described its 
functions (20, Chapter 2). Two controlled experiments in the training of 
pilots were summarized by Galt (34, Chapter 13). The results indicated 
that the use of twin-engine airplanes in basic pilot training improves per- 
formance on twin-engine airplanes in advanced training and that the 
use of optical sights on shotguns used for skeet training is desirable. The 
effect of adding five weeks of training to the normal courses in pilot 
training in the AAF was studied and the procedures and results were 
reported by Miller, Galt, and Gershenson (75, Chapter 10). A summary 
of the work of Psychological Research Project (Pilot) and recommenda- 
tions for further work in the field were provided by Miller (75, Chapter 15). 

Psychological research on radar-observer training in the AAF was pre- 
sented in a volume edited by Cook (10). The problems encountered in 
selecting and training radar observers were mentioned and the procedures 
employed to solve them were discussed. Cook compared the use of batteries 
of tests of relatively uncorrelated mental traits with the use of batteries of 
work-sample tests (10, Chapter 12). As would be expected when two 
batteries of tests measured essentially the same mental skills in two different 
combinations, both batteries turned out to provide approximately equal 
accuracy of prediction; in such a situation, the differences in intercorre- 
lations of the parts of the two batteries could have no appreciable effect 
on their accuracy of prediction of a single criterion. Hastorf (48, 
Chapter 1) defined the scope of AAF Aviation Psychology Program 
Research Report No. 12 and outlined the essential principles of radar, its 
adaptation to airborne use in combat operations, and the training program 
for radar observers in the AAF (48, Chapter 2). He also wrote (48, Chap- 
ter 3) a brief summary of research on the selection and training of radar 
observers accomplished under the auspices of the National Defense Re- 
search Committee and by Psychological Research Project (Radar). 

Studies of the acquisition and retention of air-crew skills were reviewed 
and some data pertaining to the retention of these skills during periods 
of inactivity were presented by Crawford, Sollenberger, Ward, Brown, 
and Ghiselli (12, Chapter 12). The instructional technics peculiar to the 
use of motion pictures, together with data regarding their effectiveness as 
teaching devices, were discussed by Gibson, Borin, Orvis, and Gagne (36, 
Chapter 10). 


Measurement of Proficiency and Criterion Studies 


Studies of the proficiency of bombardiers, flight engineers, flexible 
gunners, navigators, pilots, and radar observers were summarized by 
Flanagan (25, Chapter 5). Crawford, Sollenberger, Ward, Brown, and 
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Chiselli edited Report No. 16 in the series of AAF Aviation Psychology 
Program Research Reports, which included data regarding the analysis 
of duties, the criteria of proficiency used for validation purposes, and the 
validity data for a number of air-crew positions (12). Their introduction to 
the report (12, Chapter 1) indicated its scope and purpose. That the most 
fundamental and, in many respects, the most difficult problem faced in the 
AAF Aviation Psychology Program was the definition and measurement 
of satisfactory criterion variables was pointed out by Thorndike (101, 
Chapter 4). Ultimate criteria were formulated but were rarely measureable. 
Intermediate or even immediate criteria were therefore used and supple- 
mented with professional judgment. This is an excellent discussion of an 
important methodological issue. Efforts were made to maximize the rele- 
vance of available criteria and to minimize bias in them; of secondary 
consequence were efforts to maximize the reliability of criterion variables. 

Kemp discussed the development of phase checks to serve as criteria for 
validating tests used in the selection of bombardiers (58, Chapter 4). Pro- 
ficiency tests were constructed to provide measures of the practical knowl- 
edge about bombing and navigation required of bombardier students. 
These tests were described by Johnson (55, Chapter 5) and sample items 
were presented. An evaluation of various measures of proficiency for use 
in bombardier training was reported by Crawford, Sollenberger, Ward, 
Brown, and Ghiselli (12, Chapter 7). Johnson described (55, Chapter 6) 
surveys of the level of proficiency of aerial instructors and supervisory 
personnel at cadet bombardier schools and in the AAF Central Instructors 
School (Bombardier). Johnson (55, Chapter 9) also reported research on 
the development of a motion-picture test for target and check-point identi- 
fication, a study of the reliability of the circular error and of the percent 
of hits for C-1 autopilot bombing, and the results of several minor studies. 

Research work, designed to improve existing criteria for judging the 
performance of flight engineers and directed at the development of new 
criteria was described by Seaman, Unger, Dailey, and McClelland (94). 
The fact that Navigator Stanine scores have some promise for predicting 
performance in ground-school courses in operational training was indi- 
cated by Crawford, Sollenberger, Ward, Brown, and Ghiselli (12, Chap- 
ter 8) in studies of criteria for judging the proficiency of flight engineers. 

Stolurow stated that the measurement of proficiency among students at 
gunnery schools was at first ineffective (98, Chapter 7). Gradually, the 
situation was improved as well-constructed examinations became available 
and were uniformly administered and interpreted. Data concerning four 
forms of the Final Comprehensive Examination were presented. To meet 
the need for practical tests of proficiency in operating, caring for, and 
checking equipment, phase checks were developed, as described by Valen- 
tine (102, Chapter 8). A study made by Johnson and Milton (56, Chapter 
18) showed that a marked increase in accuracy of aiming a B-29 Pedestal 
Sight could be secured by redesigning the controls in the light of human 
capabilities and limitations. 


558 











REVIEW OF EDUCATIONAL RESEARCH Vol. XVIII, No. 6 





As part of the task of establishing procedures for selecting lead crews, 
research reported by Crawford, Sollenberger, Ward, Brown, and Ghiselli 
(12, Chapter 10) was undertaken to provide analyses of proficiency 
measures and synthetic-trainer scores for flexible gunners in operational 
training. These editors also reported research on evaluating the proficiency 
of air-crew members to provide criteria for selecting lead crews (12, 
Chapter 11). 

Research related to the development of aerial measures of navigation 
skill (97, Chapter 6) and objective ground measures of navigation skil| 
(97, Chapter 5) was described by Smith. Data resulting from studies of 
the graduation-elimination criterion and of the grades given in navigation 
schools were reported by Michael and Rosemark (74, Chapter 7). Analysis 
by Dudek, Peltier, Smith, Lyon, and King of the procedures used to 
determine position by means of dead-reckoning navigation indicated the 
relative importance of each of these procedures and provided leads for 
improving the teaching of dead-reckoning navigation and for decreasing 
“distance-off” (19, Chapter 9). The duties of the navigator in operational 
training and criteria for judging his proficiency were presented by Craw- 
ford, Sollenberger, Ward, Brown, and Ghiselli (12, Chapter 6). 

Problems involved in measuring pilot proficiency were discussed by 
Miller (75, Chapter 4) while Ben-Avi described the grades assigned to 
students during their flying training together with analyses and evalua- 
tions of them (2). Objective measures of flying skill that were developed 
for use in primary pilot training were presented by Youtz (113, Chapter 6) ; 
objective measures of single-engine instrument-flying skill were discussed 
and evaluated by Hagin (47, Chapter 9) while Ericksen described the 
nature and use of objective measures of‘ multi-engine instrument-flying 
skill (20, Chapter 8). Four studies concerning the measurement of pilot 
skill in flying two-engine airplanes were also reported by Ericksen (20, 
Chapter 7). Fixed-gunnery scores as objective measures of flying skill 
were evaluated in research studies summarized by Gleason (38, Chap- 
ter 11). The development and the use of printed tests of flying information 
were considered by Robbins and Levine (91). A series of studies on pro- 
ficiency measures and their validation were reported by Crawford, Sollen- 
berger, Ward, Brown, and Ghiselli concerning the fighter pilot (12. 
Chapter 2), the photo-reconnaissance pilot (12, Chapter 3), the co-pilot 
(12, Chapter 5), and the airplane commander (12, Chapter 4). Investiga- 
tions of fighter-pilot proficiency, the prediction of fighter-pilot combat 
proficiency, and fatigue factors in long-range fighter missions were sum- 
marized by Lepley (66, Chapter 10) from reports written by the investi- 
gator, Lieutenant Wilse B. Webb, an aviation psychologist attached to 
the 413th Fighter Group. Fitts reported the accuracy with which AAF pilots 
can reach objects placed around them when they are unable to see either 
the objects or their own bodies (23, Chapter 15). Accuracy is greatest 
reaching forward and below shoulder level. 

Graff, Kelley, and Hastorf discussed the development and content of five 
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printed tests for measuring the proficiency of students in radar training 
(39). The intercorrelations of several of these proficiency tests were 
presented by Kriedt, Johnston, and Kunsman (60), who pointed out the 
considerable amount of overlap indicated by the data. Six standardized 
performance tests, developed to supplement the measurement of radar- 
observer proficiency by means of paper-and-pencil tests were described 
by Bray (4, Chapter 6). Sources of unreliability in the performance-test 
scores were discussed (4, Chapter 7) and two concepts of validity were 
mentioned. The use, reliability, and relationships of the circular error in 
radar bombing with other measures of proficiency were reported by Klein 
(59, Chapter 9). It was concluded that thirty to thirty-five hours of training 
are insufficient to develop a high degree of skill in radar bombing. No 
satisfactory criteria for validating measures of proficiency for radar 
observers in operational training were found, according to Crawford, 
Sollenberger, Ward, Brown, and Ghiselli (12, Chapter 9). 

Gibson reported the construction and use of motion-picture tests for 
measuring proficiency in aircraft recognition and target identification 
(36, Chapter 6). In collaboration with Gagne, he presented experimental 
data on several aspects of aircraft recognition (33). Evaluations of the 
Renshaw system and of some alternative training procedures were made. 
Davis described the construction of several specialized examinations used 
by the AAFP, including the Aviation Cadet Educational Examination, the 
Flight-O fficer Examination, the AAF English Expression Test, and the 
Victory Corps Aeronautics Aptitude Test (14, Chapter 11). Experiments 
reported by Melton (73, Chapter 24) revealed significant impairment of 
proficiency on both paper-and-pencil and apparatus tests at 15,000 to 
18,000 feet without oxygen and at 45,000 feet with oxygen. Performance 
on an addition test was found especially sensitive to changes in altitude. 

In spite of innumerable difficulties, more than 1872 different indexes 
of the combat validity of the selection and classification tests used in the 
AAF were obtained. In a research report edited by Lepley (66), these data 
were reported and discussed. Two faults of criterion data were said by 
Lepley (66, Chapter 3) to be low reliability and bias. The criteria used 
included objective measures, administrative actions, direct and systematic 
observations, and ratings based on general impressions. Lepley described 
the use of proficiency tests for assembling lead crews and for detecting the 
need for precombat or refresher training (66, Chapter 8). The tests found 
to be most predictive for bombardier criteria of combat effectiveness were 
Spatial Orientation I and II, Mathematics A and B, Mechanical Principles, 
Discrimination Reaction Time, and the Pilot Stanine (66, Chapter 4). Of 
thirty-seven variables correlated with several measures of navigator effec- 
tiveness in combat, Lepley reported (66, Chapter 5) that sixteen had pre- 
dominantly positive correlations. The four best predictors were Technical 
Vocabulary (Navigator), Technical Vocabulary (Pilot), Arithmetic Rea- 
soning, and Mathematics B. The absolute magnitudes of the validity 
coefficients were not especially meaningful because of marked attenuation 
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resulting from the rigorous selection of navigators at classification centers 
on the basis of the Navigator Stanine. A total of 889 validation statistics, 
using the combat effectiveness of pilots as the criterion, were presented by 
Lepley (66, Chapter 6). Greatest effectiveness was found for predicting 
fighter-pilot performance. The most useful classification tests were Mechan- 
ical Principles, SAM Two-Hand Coordination, SAM Rotary Pursuit, Spatial 
Orientation I and II, Aiming Stress (portrayed on the stage in Winged 
Victory), and Table Reading. Criteria of success in combat different from 
those employed in the studies summarized by Lepley were utilized by 
Mollenkopf. He found no evidence of significant relationships between the 
criteria he used and various selection tests (77, Chapter 4). Lepley sum- 
marized the psychological research work done in various combat areas 


by sixteen officers and three enlisted men on temporary duty (66, 
Chapter 11). 


Studies of Requirements 


Studies of the requirements of air-crew positions were made at various 
times in the AAF Aviation Psychology Program and for many different 
purposes. Job requirements for the bombardier, navigator, and pilot were 
reported by Walton (106). Thorndike summarized (101, Chapter 2) the 
job-analysis procedures employed as a basis for test construction: (a) 
review of existing literature, (b) analysis of records of performance, 
(c) interviews with air-crew personnel, (d) direct experience on the part 
of psychologists, and (e) correlation of tests and criteria. 

A description of tasks performed by students in flight-engineer training 
schools was presented by Schmonsees, Unger, Riecken, and McClelland 
(92), who also made a job analysis in terms of psychological traits. 
According to Valentine (102, Chapter 2), the task of the flexible gunner 
was ordinarily that of firing at a target (an attacking fighter plane) from 
a platform (a bomber) also moving in three dimensions. A discussion of 
the skills and abilities involved in gunnery was prepared by Irion (53, 
Chapter 5), who also considered the use of synthetic trainers as criterion 
measures. 

A job analysis of the navigator’s task and the attributes of a successful 
navigator were presented by Whiteside and Glaser (109). Youtz has pro- 
vided a convenient summary of the skills and abilities required of a pilot 
(113, Chapter 3, Part I) and Ericksen made an analysis of the pilot’s task 
in specialized types of activities, such as instrument flying, night flying, 
navigation, and formation flying (20, Chapter 3, Section II). 

Kelley reported a job analysis for radar observers (57, Chapter 4) made 
largely in terms of mental abilities defined by centroid factors to which 
names were ascribed on the basis of subjective judgment. Investigations of 
the combat requirements for air-crew personnel were summarized by 
Lepley (66, Chapter 7), and Flanagan discussed research on mission 
failures and on errors of personnel during operations in combat (25, 
Chapter 7). 
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Attitudes, Morale, and Leadership 


Research on attitudes, morale, and leadership was conducted by the 
AAF Aviation Psychology Program largely in the AAF Personnel Distribu- 
tion Command in redistribution stations or convalescent hospitals. Many 
of the studies made in these installations were summarized by Flanagan 
(25, Chapter 8), among them investigations of fear and courage in aerial 
combat, anxiety reactions, counseling and therapy, and attitudes and 
preferences of combat returnees. 

Psychological research on problems of redistribution was summarized 
in a report edited by Wickert (110). In this report, Wickert recounted 
the history of psychological research in AAF redistribution stations and 
listed the personnel engaged in it (110, Chapter 1); he also made an 
over-all evaluation of the work and mentioned the potential value of data 
that were gathered but not fully analyzed during the war (110, Chapter 8). 
Crannell and Mollenkopf outlined the extensive research conducted to 
determine the essentials of combat leadership (11). Methodological prob- 
lems of research in leadership were stressed and the instruments used 
were described. Studies conducted to ascertain the nature of anxiety re- 
actions in combat were reported by Shaffer (95, Chapter 5). The tests 
used to select air-crew personnel were found to be unrelated to the pres- 
ence or absence of anxiety reaction as determined by psychiatric examina- 
tion. A Personality Inventory was developed, however, which consistently 
showed biserial correlations of the order of .50 with the criterion. With 
Kamman, Lecznar, Pearson, and Williams, Shaffer discussed surveys of 
fear and courage in aerial combat, of the psychological causes of mission 
failures, and of disorientation in instrument flying (95, Chapter 6). The 
attitudes and preferences of AAF air-crew personnel returned to the con- 
tinental United States from combat zones were described by Shaffer and 
Pearson (95, Chapter 7). Differences among fighter pilots, bomber pilots, 
bombardiers, and navigators were pointed out. 

The attitudes and opinions of flexible gunners who had recently returned 
from combat, had graduated from a training school, or had not entered 
into combat were reported by Irion (53, Chapter 10). Some of the prob- 
lems encountered in training navigators who had been returned from 
combat and assigned to the AAF Instructors School for Navigators were 
outlined by Friedman, Rosemark, Heathers, Grigg, and Zielonka (30). 
A study of the attitudes of air-crew personnel (both officer and enlisted) 
returned from combat toward further duty assignments of various types 
was described by Crawford, Sollenberger, Ward, Brown, and Ghiselli 
(12, Chapter 13). 

Bijou (3) edited a volume of the AAF Aviation Psychology Program 
Research Reports concerning research in AAF convalescent hospitals. The 
need for and develooment of the psychological services and research work 
in these hospitals was stated by Bijou and Gillman (3, Chapter 1). The 
psychological services were described in detail by McNeill, Heathers, Rotter, 
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Willerman, and Lawrence (72), including evaluation procedures and 
individual and group counseling technics. Research activities were outlined 
by Bijou and Heathers (3). The tests and inventories that were used 
were mentioned and the criteria used to validate them were listed. These 
same authors also prepared a summary and evaluation of all the service 
activities and research work of psychologists in the AAF convalescent 
hospitals (3, Chapter 11). 

Data derived from the administration of five personality inventories 
were summarized by Heathers (49), who concluded that all five of them 
possessed substantial utility. Lawrence and Levine investigated attitudes 
of patients in AAF convalescent hospitals (65, Chapter 5) and Lawrence 
reported data suggesting that biographical information may be useful in 
making prognoses for convalescent patients (65, Chapter 6). Descriptions 
of a number of interest questionnaires used in AAF convalescent hospitals 
and results obtained from their use were presented by Lucio and Mc- 
Reynolds (69). 

In an effort to measure the impairment of mental efficiency associated 
with psychiatric disorders, the Shipley-Hartford Retreat Scale for measur- 
ing mental impairment and a new Efficiency of Mental Application Test 
were tried out in the convalescent hospitals. Bijou and Lucio discussed 
the findings, noting that both tests showed promise, particularly the test 
assembled especially for use in the AAF (3, Chapter 8). Three projective 
tests were used in the convalescent hospitals by psychologists in the AAF 
Aviation Psychology Program: the Rorschach Test, the Bender Visual- 
Motor Gestalt Test, and the Incomplete Sentences Test. Of the three, the 
last seemed to differentiate best between normal and maladjusted patients. 
These data were reported by Wischner, Rotter, and Gillman (112). An 
ingenious method of quantifying interpersonal behavior in group counsel- 
ing was described by Willerman and Pascal (111) and an illustration of 
its use was given. 

In an article published in the Psychological Bulletin (79), Super dis- 
cussed case studies and clinical evaluations of aviation cadets together 
with the projective technics employed. Tho most of the work of the AAF 
Aviation Psychology Program was concerned with mass testing by means 
of objective measures, elaborate studies of clinical procedures were made 
on samples of aviation students in order to assess their efficacy. In gen- 
eral, the data showed that clinical evaluations did not add anything to 
predictions of performance made solely on the basis of machine-scorable 
objective tests. 


Tabulating and Analysis Technics 


Elaborate safeguards were employed in test-scoring operations of the 
AAF Aviation Psychology Program to prevent and catch errors. The 
procedures used in scoring classification tests were described by Gilmer 
and Preston (37, Chapter 4). These authors likewise discussed the routine 
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checks and statistical technics employed to insure comparability in the 
classification-test scores derived from apparatus tests (37, Chapter 5). 

Because validation of test scores obtained at classification centers and 
psychological examining units against performance in training courses 
and in certain air-crew duties was the foundation stone of research carried 
on in the AAF Aviation Psychology Program, it was essential to have an 
accurate, complete, and convenient records system. Gilmer and Preston 
described the routine of handling records in psychological examining units 
(37, Chapter 6) while Simon and Berwick described the records system 
at the Headquarters of the AAF Training Command (96, Chapter 10), 
where test scores from many examining units were filed together. The basic 
records files (96, Chapter 11) and the training-data files (96, Chapter 12) 
maintained in the Psychological Section at Headquarters, AAF Training 
Command were also discussed by Simon and Berwick (96, Chapter 13). 
These authors mentioned everyday problems encountered in the collection 
and maintenance of machine records and made suggestions for avoiding 
them (96, Chapter 17). They discussed the types of errors common in 
machine-records operations and methods used to control them (96, Chap- 
ter 18). General considerations in the establishment and use of machine- 
records systems, with illustrations from their experience in the AAF 
Aviation Psychology Program, were presented by Simon and Berwick 
(96, Chapter 9). They also discussed the dissemination of data by means 
of roster, punched cards, and microfilms (96, Chapter 14). 

In AAF Aviation Psychology Program Research Report No. 3 edited 
by Thorndike (101), some of the technical problems encountered in 
psychological research work during the war were considered and the pro- 
cedures developed to meet them were summarized. Special attention was 
given to problems associated with the selection and classification of per- 
sonnel. To express validity coefficients, the product-movement r was used 
whenever possible. With dichotomized criteria, biserial rather than point- 
biserial r’s were computed in order to minimize the effect of variation 
in the position of the dichotomic line on validity coefficients obtained in 
different samples. Thorndike discussed these and other correlation statistics 
used in determining the validity of single tests (101, Chapter 5) and pre- 
sented the formulas used to correct for restriction of range due to prior 
selection. Procedures for obtaining composite aptitude scores were outlined 
by Thorndike (101, Chapter 6). The multiple-regression and multiple- 
cutoff methods were contrasted and the reasons for choosing the former 
for use in the AAF Aviation Psychology Program were mentioned. A for- 
mulation of the problem of a unique classification system was presented. 
Emphasis was given (101, Chapter 8) to the significance of the intercor- 
relations of a set of variables proposed for use for selection purposes. 
Three types of prediction problems were identified: selection, multiple 
selection, and classification. The importance of test reliability as an aid 
to interpreting test-validity data was stressed by Thorndike (101, Chap- 
ter 7) and various ways of computing reliability coefficients were men- 


559 








Review or EpucaTIionaL RESEARCH Vol. XVIII, No. 6 





tioned. An analysis of the sources of variance in test scores was especially 
noteworthy (p. 102-103). Several formulas developed by A. P. Horst to 
determine the loss of test validity ascribable to extraneous variance in 
test scores were presented by Thorndike (101, Chapter 9). Methods used 
to minimize extraneous variance in test scores obtained in the AAF 
Aviation Psychology Program, especially in apparatus-test scores, were 
found to be highly effective. 

Most research workers will want to become familiar with parts of AAF 
Aviation Psychology Program Research Report No. 18, edited by Deemer 
(17). In this report Alchian has written four chapters on the methods of 
statistical analysis employed in the AAF Aviation Psychology Program 
that are notable for their presentation of up-to-date concepts in surprisingly 
compact and straightforward fashion. The basic principles of modern 
statistical analysis and inference were stated succinctly (1, Chapter 20) and 
were followed by detailed descriptions of the procedures used to estimate 
the parameters of univariate distributions (1, Chapter 21). The statistics 
employed in bivariate analyses were set forth with the tests of significance 
appropriate for use with them (1, Chapter 22) and technics of multi- 
variate analysis were described with special reference to regression sta- 
tistics (1, Chapter 23). 

This research report (No. 18) also includes two interesting chapters 
written by Simon and Berwick on machine technics. In one of these, 
detailed procedures for obtaining biserial correlation coefficients and 
intercorrelations were presented (96, Chapter 15), and in the other a 
method for obtaining the sums of squares and of products with the IBM 
alphabetical accounting machine was described (94). 

Statistical procedures commonly used in one or two psychological 
research units for computing reliability coefficients, item-analysis data, 
validity data, and factorial data regarding items and tests were outlined 
by Humphreys (52, Chapter 3). The type of internal-consistency and 
external-criterion item-analysis data used in the development of the 
Aviation Cadet Qualifying Examination and many other examinations was 
explained by Davis (14, Appendix A). Detailed instructions for computing 
the data as well as evidence of its reliability were provided. Item difficulty 
indexes were found to be more reliably determined than item-test correlation 
coefficients. Guilford discussed the factorial composition of a large number 
of the tests developed for use in classifying aviation students and related 
these data to the criteria that were to be predicted (43, Chapter 28). 


Design of Equipment 


The establishment of a Psychology Branch in the Aeromedical Labora- 
tory at Wright Field, as reported by Fitts (24), provided a central point 
for psychological research on the design of equipment. Previously, con- 
siderable work in this area had been accomplished in the AAF Aviation 
Psychology Program, but no organization had been specifically charged 
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with the responsibility. A report edited by Fitts (23) presented the research 
data accumulated on the design of equipment with regard to human capa- 
bilities and limitations. The nature of engineering psychology and its 
applications, methods, and technics were discussed by Fitts (23, Chapter 1). 
Problems associated with the means of presenting information obtained 
in the form of instrument readings were described by Grether (40, Chap- 
ter 2). Brown and Jenkins prepared an outline of research related to the 
design of equipment, which was based on an analysis of human motor 
abilities (5). A bibliography was appended. 

A number of studies have been made to determine how aircraft instru- 
ments and accessory materials should be designed to minimize errors in 
using them. Comparing the relative ease and accuracy with which tables 
and graphs were read, Carter concluded that tables are preferable as a 
means of presenting data if interpolation is not required. If it is, graphs 
are to be preferred (7, Chapter 4). The sources of error in reading air 
navigation plotters were identified and, according to Christensen (9), a 
new plotter has been designed that should prove considerably easier to 
use. Grether has shown that a twenty-four-hour dial face on a clock is 
easier to read than a twelve-hour dial face provided that time is to be 
read according to the twenty-four-hour system (40, Chapter 6). Optimum 
characteristics of a twenty-four-hour clock face were determined. Some 
findings with respect to dial faces are interesting; Grether and Williams 
discovered (40, Chapter 7) that the accuracy with which dials were read 
increased as their diameters were increased up to two inches. It also 
increased as gradations were increased to seven-tenths of an inch. On the 
other hand, speed of dial reading did not appear to be related to size of 
dial diameter or scale interval. A study of the interpretability of various 
types of aircraft attitude indicators, made by Loucks (68), showed that 
for blind flying the horizon should remain fixed and a three-dimensional 
miniature aircraft should constitute the moving element and should move 
in the direction in which the plane rolls. 

Another group of related studies pertained to airplane control knobs 
and their uses. Weitz (108) concluded that coding control knobs by color 
and shape helped reduce the difficulty normally experienced when a pilot 
shifts to an unfamiliar airplane in which the controls are placed differently. 
Experimentation with control knobs of various shapes has indicated, ac- 
cording to Jenkins (54, Chapter 14), that knobs of certain shapes are less 
frequently confused than others and should be standardized for use in 
aircraft cockpits. Data obtained by Grether (40, Chapter 17) showed that 
airplane controls can be handled more efficiently with the arms and hands 
than with the legs and feet. Fore-and-aft movements were found more 
efficient than lateral movements. If a group of controls must be adjusted 
rapidly in a certain sequence, Murray recommended (78) that they be 
operated in a similar direction. Clockwise movement of a rotary control 
of an indicator should, Carter and Murray found (7, Chapter 10), be 
associated with downward and left-to-right movement of the indicator. 
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In a different experiment, Warrick discovered that clockwise rotation of a 
control knob should be associated with movement of an indicator toward 
the operator and from right to left (107, Chapter 9). 

Mild anoxia (the condition resulting from lack of sufficient oxygen) 
seemed not to affect the number of illusions under experimental condi- 
tions reported by Grether, Cowles, and Jones (40, Chapter 19). The errors 
made by a pilot in reading instrument dials tends to increase in the 
presence of moderate G force, as indicated by Warrick, Nelson, and Lund 
(107, Chapter 20). 

On the basis of an investigation of ability to reproduce pressures, Jenkins 
concluded (54, Chapter 12) that a wide range of pressures from five pounds 
up to thirty or forty pounds should be required in the operation of air- 
plane controls. Pressures greater or less than those limits seem to be more 
difficult to reproduce accurately. According to Van Saun (104), for radar 
operators the polar-grid sector scope was superior to the cartesian-grid 
sector scope. Both scopes were more readily interpreted when the PP] 
scope and the sector scopes had the same orientation. 

Important contributions were made by psychologists to the design of 
equipment for flexible gunners and t« the technics used in sighting and 
aiming. Some of these have been mentioned previously in this chapter; 
others were discussed by Vallance (103, Chapter 14). 


Motion-Picture Testing and Research 


A complete report of the work on motion-picture testing and research 
in the AAF Aviation Psychology Program was provided in the volume 
edited by Gibson (36). Some special aspects of the work have been men- 
tioned previously in this chapter in connection with topics to which they are 
relevant; other aspects of the work will be summarized in this section. 

The history, functions, and personnel of the Psychological Test Film 
Unit were first published in the Psychological Bulletin together with the 
hypotheses to be tested, research work under way, and test-construction 
technics employed (90). Gibson wrote more fully on these topics (36, 
Chapter 1). The peculiar characteristics of motion-picture tests and the 
unique possibilities of their application to psychological testing were dis- 
cussed by Gagne, Bornemeier, Gibson, and Borin (32). Many practical 
problems in constructing and producing motion-picture tests were reported 
by Gibson, Bornemeier, Eisenberg, and Slater (36, Chapter 3). Some of 
the problems confronted in the presentation of motion-picture tests were 
mentioned by Finney and Gibson (21). Experimental evidence of the 
effects of varied amounts of illumination and of seating position on the 
perception of motion pictures was obtained. More theoretical were Gibson’s 
discussion (36, Chapter 8) of the differences between the perception of 
pictures and the perception of visual realities and Gibson and Glaser’s 
formulation (36, Chapter 9) of a systematic theory to account for observed 
data regarding individual ability in monocular space perception. Further 
research in the perception of space is needed, according to the authors. 
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Implications of Psychological Research in the 
AAF Aviation Psychology Program 


It has been impossible to include in this chapter all of the articles con- 
cerning the applications to psychological and educational research of work 
done in the AAF Aviation Psychology Program, but a great many refer- 
ences have been reviewed. 

Flanagan discussed general contributions of the AAF Aviation Psy- 
chology Program to the theory and knowledge of individual differences 
and trait differences (25, Chapter 9). Considerations pertaining to the 
trait theory of human abilities, the measurement of traits, the significance 
of motivation, and the nature and significance of personality factors were 
taken up. Of special interest to educators was Flanagan’s statement of the 
implications of research in aviation psychology regarding the nature and 
principles of learning, the relative importance of aptitude and training, 
and the measurement of success (25, Chapter 10). The procedures utilized 
in the AAF Aviation Psychology Program for the measurement of achieve- 
ment and the prediction of human behavior will be of interest to research 
workers in psychology and education. Flanagan has discussed these along 
with the statistical technics and experimental methods employed (25, 
Chapter 12). He has also commented on several types of research studies 
leading to the design of equipment for maximum efficiency (25, Chapter 
11). Altho to many research workers, much of the experimental work on 
the design of equipment may appear to be elaborate (and, therefore, ex- 
pensive) demonstrations of the obvious, Flanagan believes that much work 
will be carried on in this field in the future. It would appear that this is 
likely, since its application in industrial psychology is clear. 

Guilford has published considerable material concerning the general con- 
clusions and implications drawn from testing and classifying aviation 
cadets in the AAF (43, Chapter 29). The discovery of aptitude and 
achievement variables was reported (42) and Guilford and Zimmerman 
(46) listed twenty-seven factors found by centroid analyses of a number 
of different correlation matrices based on scores from tests administered to 
highly selected men in aviation-cadet training. The factors have been iden- 
tified subjectively by the authors and their co-workers in the AAF Aviation 
Psychology Program. In the case of ten of these factors the authors believe 
the names chosen for them may be reasonably accurate descriptions. The 
practical value of well-established psychological principles was demon- 
strated during the war, in the opinion of Guilford (43), who drew some 
lessons from aviation psychology. In another publication (44), Guilford 
mentioned findings that confirmed long-established principles of test theory ; 
namely, that test validity coefficients are more important than test reliability 
coefficients for evaluating tests and that the value of a test for multiple 
selection should be judged in terms of its unique contribution to accuracy 
of prediction rather than in terms of its validity coefficient. There will be 
general agreement on these points, but whether factorial analysis pro- 
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vides the means of reaching the desired objectives may be a matter for 
further discussion among test technicians. 

Davis prepared for a commission of the American Council on Educa- 
tion a brief description of the selection and classification procedures used 
in the armed forces together with their implications for civilian education 
(15). He indicated that the technics for selecting and classifying aviation 
cadets in the AAF Aviation Psychology Program constituted the first 
practical demonstration of the principles that are likely to form the basis 
for soundly conceived instruments useful in educational and vocational 
guidance in the future. For differential selection and classification of 
personnel it appears likely that tests will be developed to measure the 
variance common to the several criteria to be predicted and to measure 
separately the variance that is unique to each one of the criteria. The 
relative weighting of the tests measuring common and unique variance 
will depend on the proportion of the available manpower that can he 
rejected entirely. To secure measures of unique variance in each criterion 
it is not enough to make use of tests that are merely independent; in prac- 
tice, such tests will probably be constructed by correlating individual 
test items (that measure as nearly as possible only one mental function 
and that are maximally reliable) with each one of the criteria to be 
predicted and building up groups of items that have correlations as high 
as possible with one criterion and as low as possible with all other 
criteria. This is the logical extension to test construction for purposes of 
differential classification of the principles employed to construct the 
Aviation Cadet Qualifying Examination for purposes of selection alone. 
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CHAPTER IV 


The Personnel Research Program of the Adjutant 
General’s Office of the United States Army 


E. DONALD SISSON 


Te conrrrutions of the Personnel Research Section of the Adjutant 
General’s Office in World War II are reviewed in this chapter. This pro- 
gram was established in 1940 in the Adjutant General’s Office with the 
advice of the Committee on Classification of Military Personnel of the 
National Research Council. This Committee, of which W. V. Bingham was 
chairman, included C. C. Brigham, H. E. Garrett, L. L. Thurstone, L. J. 
O’Rourke, M. W. Richardson, and C. L. Shartle. 

The six main sections of this chapter present the work of the staff of 
the Adjutant General’s Office on selection (I) and classification (II) pro- 
cedures, training (III), the measurement of proficiency (IV), leadership 
(V), and tabulating and analysis technics (VI). Since the contributions 
of this group in the form of numbered pamphlets in the Personnel Research 
Section Report series are anonymous, the individuals who served on this 
staff from 1940 to 1946 are listed in the accompanying footnote.* 
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I. Selection 
Induction Standards 


The differentiation of those who could learn to be proficient soldiers in 
a reasonable length of time from those of insufficient learning capacity 
for such service was a preliminary necessity in the utilization of man- 
power in the war effort. In order to distinguish enlistees who could learn 
duties of a soldier in the usual amount of time (Army General Classification 
Test Grade III) from slow learners (Army General Classification Test 
Grade IV), Classification Test R-1 was developed (35) from AGCT-la 
items with item-grade correlations of .35—.65. It was standardized (61) 
in June 1941. Critical scores equivalent to AGCT standard scores of ninety 
and one hundred were derived. Another form, R-2, was prepared from 
AGCT-1b in February 1942 (106), and similar critical scores established 
(119). Forms R-3 and R-4 were ready in May 1946 for use with men 
enlisting or reenlisting in the Army, and the relationship of these forms 
to AGCT-3a was studied (310). Placement and achievement tests in reading 
and arithmetic were constructed (254) for each of the four levels of 
training given in Special Training Units (STU) which were set up to 
teach illiterates possessing learning ability. Preliminary research was 
extensive; experimental forms were studied for item content, item-analyzed 
for difficulty, and validated. Standard score scales were constructed for 
these tests (250, 251). 

Literacy Tests. Attempts were made to determine minimum literacy re- 
quirements for acceptance for induction, and to develop measures of 
mental capacity not dependent upon higher literacy levels. Minimum 
literacy tests were constructed early in 1941 to eliminate those unable 
to read at the fourth grade level. Critical scores were determined (56) 
using the Metropolitan Advanced Reading Test, Form A, as the reading 
ability criterion. Minimum Literacy Test (Form 1) scores of engineer 
trainees at Ft. Belvoir in August 1941 were studied in relation to ratings 
of unsatisfactory, satisfactory, and outstanding on fourteen criteria obtained 
from training records (55). A tetrachoric r of .45 between those passing 
and failing the test and the percent above and below the median training 
rating indicated some relationship between success on literacy test and 
success on “job.” A sharp increase in the percent of unsatisfactory ratings 
below fourth-grade level indicated this minimum reading ability was a 
reasonable critical level. Two forms of a verbal measure of general learning 
ability, Qualification Test Q-1 and Qualification Test Q-2, were released 
in June 1943, replacing the “pure literacy” test. Each test contained items 
on paragraph reading, arithmetic computation, and general orientation. 
Critical scores were established by which men were accepted, rejected, 
or assigned to special training. The percents of 3311 men tested in induction 
centers in three service commands scoring in the critical score intervals 


were tabulated (212, 242). The relationships of Q-1 and other tests to 
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level of training in Special Training Units were studied (192) in Sep- 
tember 1943. 

Nonverbal Group Mental Ability Tests. Visual Classification Tests VC-1, 
X-1; VC-1, X-2; VC-1, X-3; and VC-la were nonlanguage tests con- 
structed to select a quota of illiterates with sufficient mental capacity to 
absorb army training. The item types included visual perception, paired 
comparison, and abstraction. Revised forms were developed from item 
analyses of the preceding forms (137, 157, 158, 160). VC-1, X-2 was 
standardized on a population of 764 men containing Negroes and whites 
in a ratio of approximately five to one (138). A lower critical score was 
set to exclude the lowest 2 percent of the Army GCT population, an upper 
critical score corresponding to an AGCT standard score of sixty—-Grade IV. 

Individual Tests of Mental Ability. The Wechsler Self-Administering 
Test was found too difficult, with too narrow a range of scores among low- 
grade men. Study and item analysis (112) in March 1942 showed correla- 
tions of .83 with the AGCT for unrestricted range of 1250 men and .23 
for restricted range of 375 low-grade (Grade V on AGCT) men. Its 
validity as a predictor of soldier performance ratings in Special Training 
Units was very low (218). Low correlations with these ratings of other 
Army induction tests (216), suggested the possible inadequacy of ratings. 

Over-All Studies. In validation studies (test scores with ratings in Special 
Training Units) of groups of tests (209, 210, 211, 213), the tests with some 
verbal component were better than the others in screening the unsatisfac- 
tory STU men. Biserials between test scores and rejection after STU train- 
ing or graduation ranged from .38 to .72 for the tests in use December 
1943 to February 1944 (214). Various combinations of tests gave multiple 
correlations well above .60. Qualification Test Q-1, dependent to some 
extent upon literacy, was the best predictor. Standardization data (215) 
were obtained on three tentative test batteries. The results of the new induc- 
tion program in June 1944 showed higher rejection rates than the old 
programs—approximately a 3.5 percent difference each month (242). 
Also, the educational inferiority of the southern selectee was evidenced 
by comparative rejection rates of Negroes and whites. 

Recruiting Standards for the Women’s Army Corps (WAC). Several 
mental alertness tests were developed for selecting women for the WAC. 
Women’s Classification Test WCT-1, X-2 (first designated Mental Alertness 
Test MAT-1, X-2), used in the selection of both enlisted women and officer 
candidates, was standardized (150) in 1942. The test had a Kuder-Rich- 
ardson reliability of .94 and correlated highly with the AGCT, Otis Group 
Intelligence, Otis Self-Administering, and the ACE tests (150). A revised 
form, WCT-2 used only to select enlistees had a Kuder-Richardson relia- 
bility of .97 and correlated .85 with the AGCT (199). It was standardized 
(202) on 12,000 applicants in October 1943. WCT-2 was superior to the 
AGCT in predicting academic grades in WAC officer candidate school, 
but neither of the tests predicted leadership ratings (191). In 1944 a 
short recruiting test, Classification Test R-1, replaced WCT-2. R-1, which 
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has also been used for regular Army recruitment (v.s.) had a reliability 
of .94 and correlated .78 with WCT-2 (197). 


Selection for Specialist Training 


A meteorology aptitude test was used by the Air Corps thru the war for 
the selection of weather observer students. This test battery, consisting of 
a mental alertness test of the traditional type and fifty meteorology and 
144 physics true-false items, gave adequate validities and reliabilities (231). 
Aircraft Warning Aptitude Test TC-10A contained a section on locating 
grid points by coordinates and a section on plotting coordinates. The first 
part proved valid against the criterion of theoretical grades in courses, the 
second part against performance grades. Aircraft Warning Classification 
Test TC-l1la was given to those who passed the previously mentioned test 
for classification into potential specialist categories. More than 90 percent 
of failures were eliminated (226). Among the specialized tests used in 
small, sometimes unsuccessful, programs to select trainees for highly 
specialized Army courses, were those for Balloon Barrage courses (129), 
Combat Intelligence courses (181, 196), Military Police courses (171). 
and Medical Technicians (162). A battery including the AGCT and 
several mechanical aptitude tests was investigated for use to select Air 
Corps bombardiers and navigators. Paper-and-pencil tests were found 
to be related (73) to academic course grades but not to flight-training 
records. However, reliability of the latter criterion was low. Research in 
this area was subsequently transferred to the Air Surgeon’s Office in 
December 1941. A large-scale comparative study of apparatus and written 
tests was conducted for the purpose of validating and standardizing an 
aptitude testing program for Air Corps basic-training centers. The written 
tests were generally superior to the apparatus tests against the criterion 
of academic success in training courses (227, 228, 229). Tests finally 
chosen for the battery contained only two performance tests out of a large 
number tried: (a) Nut and Bolt Manual Dexterity Test TC-5a, and 
(b) U-Bolt Assembly Test TC-6a. An attempt was made to validate the 
instruments against on-the-job performance as judged by five types of 
supervisory ratings of Air Force ground-crew men in active units. The 
written tests showed lower validity than with the criterion of academic 
success. All validities were much lower than in unselected, untrained 
populations (290, 291). The U-Bolt Assembly Test appeared to be of 
some promise. Tests of informal information in shop mechanics, auto- 
motive and driver information, electricity, and radio, originally intended 
for use together with the latest form of the AGCT to form a comprehensive 
basic classification battery in initial processing, have instead been adopted 
for use at training centers to select men for specialized training. Two forms 
of the Automotive Information Test (Al-1 and 2) and of the Shop 
Mechanics Test (SM-1 and 2) have been standardized on large populations 
(252). Extensive item analyses and validity studies of all four tests have 
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also been conducted (228, 236, 280, 281, 282, 283, 284), with adequate 
validities obtained. A preference record and a self-description form based 
on the forced-choice technic were validated against a production index 
and a 3-point rating in a study of selection instruments for personnel 
suitable for recruiting work (289), with disappointing results. 

Radio Code Operators. Investigations by the Personnel Research Section 
and various other agencies have resulted in the authorization for Army-wide 
use of two tests for selecting radio code operator trainees. Many other 
code aptitude tests have been considered. The criteria used for the validities 
reported here include number of hours to reach specified receiving speeds, 
final code speed attained, and the NDRC Code Receiving Tests. Usually, 
several of these were considered in each study (15). The Signal Corps Code 
Aptitude Test (SCCA) evolved from a test tried out by the Signal Corps 
between 1924 and 1931. By 1941 the SCCA was widely used by several of 
the Arms and Services. Usually administered by phonographic transcrip- 
tions, the SCCA was a code discrimination type test containing seventy-eight 
pairs of patterns to be identified as “same” or “different.” Reliability esti- 
mates by Kuder-Richardson Formula No. 21 were much lower than 
desirable, ranging from .67 to .78, except for one sample for which .88 was 
reported (60, 91, 140, 184, 195). Validity data varied considerably from 
one sample to another, with coefficients from —.03 to .57 (18, 60, 83, 92, 
139). Data reported by the Signal Corps for testing between the wars gave 
validities ranging from .54 to .75 (105). Little improvement in reliability 
or validity resulted from doubling the SCCA to make the Radiotelegraph 
Operator Aptitude Test ROA-1, X-1, which was authorized for Army-wide 
reception center use in July 1942. The Kuder-Richardson reliability was 
.87 and reliabilities by the split-half method ranged from .73 to .82 (146, 
184, 195). Validity of ROA-1, X-1 was only ‘ ‘r, usually around .30 (161, 
186, 195, 224). The test was standardized on the basis of SCCA results, 
standard scores being set to a mean of 100 and a sigma of 20 (102). 
Studies indicated that previous musical instrument experience as well as 
code experience were positively related to radio code test scores and 
added to success in radio code training (60, 184). Data from numerous 
radio operator specialist schools indicate that fewer failures result if men 
are preselected on ROA-1, X-1 plus AGCT rather than AGCT alone (224). 
Army Radio Code Aptitude Test, ARC-1, a code learning test developed by 
The National Defense Research Council (18) took the place of ROA-I, 
X-1 toward the end of 1944. The test required the recognition of three 
learned Morse Code letters when presented with unlearned characters. 
Validities for ARC-1, usually between .50 and .60, were higher than those 
for ROA-1, X-1 or the Thurstone Code Aptitude Test (224). A check on 
the standardization sample resulted in the raising of raw score equivalents 
for the various standard scores (248). A series of Code Learning Tests, 
work-sample tests based on the same principle as ARC-1, showed consider- 
able promise but were never carried to the completion stage. Reliabilities 
by Kuder-Richardson Formula No. 21 and by estimation from odd-even 
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correlations ranging from .94 to .98 have resulted for various editions of the 
test (65, 91, 128, 140, 161). Validities have, in general, been as good as for 
either the ROA-1, X-1 or ARC-1 (65, 140). A paper-and-pencil alphabet. 
symbol Substitution Test-1, X-1, also developed by the Personnel Research 
Section, gave reliabilities over .95 but was somewhat less valid than ROA-], 
X-1 (128, 140, 161). A revised edition of a Code Rhythm Test developed 
by Thurstone has also shown some promise (105, 128). The Thurstone 
Code Aptitude Test was tried out in studies on ARC-] and a revision 
of this test, designated ROA-2, X-1, was accomplished (205). Both tests 
were highly reliable but the original Thurstone test was more valid. 

Truck Drivers. Research resulted in the standardization and validation 
of a group of tests (16), including a Driver Experience Inventory, a Driver 
Information Test, tests of visual acuity and night vision, and a reaction- 
time test. Other well-known psychophysical tests were assessed as predictors 
of driving ability. Most frequent criterion was a road test, consisting of 
fifteen to twenty minutes observation of driver in the standardized situa- 
tion. Specific tests were checked on a Road Test Check List. A score 
consisting of checks of correct or incorrect operations, weighted or un- 
weighted was obtained, plus an over-all rating. Reliability coefficients for 
the road test were not as high as those usually obtained for objective tests, 
but equalled those usually obtained for criteria in validity studies (24, 
64, 72, 86, 126). Two forms of a Driver Information Test (DIT) were 
standardized (172, 176). Trial of personal history items showed driving 
experience items to be most valid (25, 72, 123, 134, 152). A Driver Experi- 
ence Inventory showed variable validity, fairly high in certain populations 
(120, 147, 172). 

Visual Acuity. Several of the more familiar visual acuity tests gave con- 
sistently low correlations with the road test ( 25, 72, 126). Tests of night 
vision gave higher. validities against a special night road test (71, 76, 
131). Studies of race differences in night vision (25, 78, 121) produced 
no consistent or significant results. High sugar intake showed no effect on 
night vision (72). Studies are currently in progress on the standardiza- 
tion of new tests of visual acuity and night vision. 

Sensori-Motor Tests. Data show low positive and zero correlations and 
some inconsistency from sample to sample in studies of several sensori- 
motor tests as predictors of ratings of driving ability. However, popula- 
tions were often small and criteria not always reliable. In addition, soldiers 
are already a physically selected population (25, 71, 126). 


Trade Knowledge Tests 


Numerous editions of tests in electricity, radio, and automotive mechanics 
have been developed to aid in the identification of those with interests and 
aptitudes in these fields, as evidenced by possession of informal informa- 
tion. A General Electrical and Radio Information Test was constructed 
after item analysis of experimental forms (100). Subsequently, separate 
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series in electricity and in radio were established. Limited analysis revealed 
some fair validities (227, 239). A General Automotive Information Test 
yielded a correlation of .67 with course grades of 147 men (110). The test 


was further expanded and item analyses and additional validity studies 
were carried out (122, 142, 169, 183, 239). 


West Point Qualifying Examinations 


Each year since 1942 a new form of West Point Qualifying Examination 
(WPQ) has been constructed for administration along with the regular 
West Point examinations. It is intended that this battery eventually re- 
place present West Point entrance examinations. The latest form of the 
WPQ contains two subtests, Language Aptitude (learning an artificial 
language), and Elementary Mathematics (the use of short-cuts in arith- 
metical and algebraic processes). Each year’s series was tried experi- 
mentally and prevalidated on classes already selected and attending the 
Academy and administered in final form to applicants the following year. 
Additional validity data were gathered subsequently, with academic success 
as criterion (164, 174, 182, 190, 219, 220). 


Selection of Warrant Officers 


Objective examinations have been developed in over thirty administra- 
tive and technical military specialties for the selection of warrant officers. 
The subjects range from Auditing and Accounting to Weather and Cryp- 
tography. No reliability or validity data have been gathered on these tests, 
altho they were constructed with the aid of technical experts and have 
been widely used (237, 275, 276). 


Personality Studies 


In 1942, Personnel Form P-1, also known as the Shipley Personality In- 
ventory, was proposed as a group test for military use in differentiating 
troublemakers, neurotics, and normals. Extreme troublemakers and ex- 
treme neurotics were identified by this instrument (149), but reliability 
(136) was very low. More intensive research in personality measurement 
by the Personnel Research Section was begun toward the end of 1943. 
The Minnesota Multiphasic Personality Inventory, adapted and revised 
for Army use, the Cornell Selectee Index, the Army Individual Test (See 
Chapter II), and the Biographical Information Blank (See Chapter VI) 
were among the instruments studied (61). The Multiphasic Personality 
Inventory, a paper-and-pencil objective test, scored separately for each of 
nine psychiatric classifications, showed promise; its items were analyzed 
and some selected (294) for inclusion in the Biographical Information 
Blank used in the regular army officer retention program (See Chapter 
II). Validity studies of the Multiphasic Inventory were made in connec- 
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tion with predicting AWOL’s and psychiatric referrals among basic trainees, 
predicting psychiatric rejects among WAC applicants, selecting trainees for 
Arctic duty, and validating against careful psychiatric diagnoses. Results 
have been favorable enough to justify further study and development of 
the inventory. An item analysis of responses of “good” and “bad” WAC 
applicants did not show significant differences (314) . Correlational analyses 
of the Army Individual Test, composed of six separate subtests and ad. 
ministered to a population of psychoneurotics and psychotics, suggested 
the validity of the AIT for differentiating between psychoneurotics and 
specific psychiatric diagnostic categories, particularly when products and 
squares of test scores, rather than sums and differences, were used (313). 
The Army Wechsler subtest scores were found to add little to predictions. 


II. Classification 
The Army General Classification Test 
The Army General Classification Test (3, 7), providing an index of the 


learning ability of recruits to facilitate classification for training and job 
assignment, was first released as AGCT-la in October 1940. Subsequent 
forms, including two Spanish versions (258), were issued during the war: 
AGCT-1b in April 1941; Ic and Id in October 1941. These consisted of 140 
to 150 multiple-choice items on vocabulary, arithmetic, and block counting. 
Raw scores were converted to standard scores with a mean of 100 and a 
standard deviation of 20. Standard scores were divided into five Army 
Grades. A revised series, in which part scores were recorded for the first 
time, appeared as AGCT-3a in April 1945 and AGCT-3b in 1946 (236, 278). 
The AGCT-3 series contained four tests: reading and vocabulary (189), 
arithmetic computation, arithmetic reasoning (235), and pattern analysis. 
The total score was the equivalent of the AGCT-1 score while part scores 
were also used in classification. An information battery, originally intended 
for inclusion in the AGCT-3, was used instead for classification purposes 
at training centers. Four forms of each type of subtest in the AGCT-3 were 
developed and equated for content and difficulty. 

Standardization. Standardization of Form la (31) was accomplished, 
before the first inductees under the Selective Service Act entered the Army. 
on a population of regular Army and CCC men equated to the expected 
Army population by weighting on age, education, and area of residence. 
race, occupational deferments, illiteracy, direct commissions, the distribu- 
tion curve of the actual Army population varied from the expected. Despite 
this variance, the conversion table for Form 1b (40) was computed by 
combination regression of la and 1b scores, because the norms for Ja were 
already widely used for classification. Standardization of Forms Ic and 
ld (42) was based on Ja and 1b. Distributions on Forms Ic and Id had 
less negative skewness, and the conversion tables were set up to yield 


582 





December 1948 PERSONNEL RESEARCH PROGRAM OF THE ARMY 





Army grade percent midway between the old and the new forms. Improved 
discrimination of Jc and Id was partly due to more equitable distribution 
of item difficulty. Form 3a was standardized (236) on a population of 
39,000, carefully stratified and weighted by age, education, race, and 
geographical location. 

Item Analyses. Studies of response frequencies, item difficulty, dis- 
criminating power, and item-test consistency (29, 30, 35, 41, 115) were 
used for guidance in construction of alternate forms. It was found that 
equal scores might represent widely differing performances in type of 
questions answered (143, 117). Most extensive item analyses were made 
on the four trial forms for the AGCT-3 (236). Final form items were care- 
fully graduated in difficulty and selected on the basis of item-total test 
correlation. 

Practice Effect. Study of practice effect on la and 1b scores (39) and Ic 
and Id scores (42) showed small but consistent increases regardless of 
which form was taken first. Retesting after considerable lapse of time for 
Grade V men (173) and men in OCS (163) showed similar results, which 
were attributable to factors other than the effects of Army training. 

Part Scores. Altho part scores of the AGCT-I were not used for classifi- 
cation, investigation was made of relative contributions, discriminative 
power, intercorrelation, reliability of parts, and correlations with part and 
total scores of other forms (36, 38, 114). Each part was found to make a 
significant contribution. Combined vocabulary and arithmetic scores of 
one form were found as good as total scores for predicting total scores on 
a second form. 

Reliability. Repeated reliability estimates on all forms by Kuder-Richard- 
son Formula No. 21 (31, 40, 42, 236), odd-even comparisons (31), retest 
(75), alternate forms (38, 42, 236), and Kuder-Richardson Formula No. 
2 (236) placed the reliability generally above .90. 

Validation. Several hundred validity coefficients attest to the value of 
the AGCT in selecting men for a large number of Army specialist courses 
(27, 37, 57, 68, 73, 77, 89, 92, 94, 97, 99, 108, 113, 129, 132, 174, 175, 
176, 178, 201, 213, 223, 226, 277, 324, 336, 338). Most of the populations 
were preselected either on the AGCT or on some highly correlated factor. 
The criterion was usually academic grades. Where preselection was rigorous, 
correlations were lower. Validities for criteria involving personality, e.g., 
success in Officer Candidate School (99, 132, 175, 198) , or formal academic 
background, e.g., success in the Army Specialized Training Program (324, 
336, 338), are low. AGCT-3a (227) was generally superior to AGCT-1. 
The reading and vocabulary subtest correlates highest with written ex- 
aminations, and pattern analysis is usually the best predictor of practical 
performance. Use of part or combined subscores in classification is ques- 
tionable because of high subscore intercorrelations. 

Relationship with Other Variables. Studies show high correlation with 
education (31, 118, 127, 136, 270) and with other well-known tests of 
mental ability (32, 34, 103, 104, 165, 257, 331), decreasing with restric- 
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tive preselection, but no significant relationship with age (31, 118, 236). 
except that in highly selected groups, correlations tended to be slight 
negative. Comparisons of male and female Army populations in age, cul- 
tural and educational background, selection methods, and geographical 
distribution, had inconclusive results (236). Comparisons of Negroes and 
whites (117, 270), complicated by social, cultural, and educational differ. 
ences, showed lower mean scores for Negroes, the difference decreasing 
where educational status was matched. Mean scores for northern soldiers 
of both races are higher than for southern soldiers of the corresponding 
race. An early tentative study of relationship to civilian and military oc. 
cupations (26) was made. Later studies showed a definite occupational 
hierarchy and sectional differences within occupations (270), despite con- 
siderable overlap, even between highest and lowest ranks; but no relation- 
ship to age or experience was found. Variability of scores was higher in 
lower level occupations. Occupations with restricted score ranges probably 
depend on abilities measured by AGCT; others with wider ranges depend 
more on specific interests or aptitudes. For counseling purposes a low score 
was considered possible ground for avoiding a high level occupation, but 
a high score per se is no ground for avoiding any occupation. 

Special General Classification Tests. A special Non-Language Test 2abc 
to test illiterates and Grade V men was standardized (59) on a population 
with a normal distribution of AGCT scores. An Army Individual Test 
(AIT) of general mental ability (16, 17) consisting of three verbal and 
three performance tests (221, 222) was standardized (230) on a group of 
1000 native-born literate whites. A study on a small population (222) 
indicated that the test could discriminate between Grade V men in Special 
Training Units who were likely to succeed or likely to fail in Army 
training. 


The Mechanical Aptitude Test (MAT) 


The general Mechanical Aptitude Test MA-1 appeared in February 1941. 
Forms MA-2 and MA-3 were released in October 1941. A later form, M A-4, 
X-1, was built for WACs. MA-I consisted of items on mechanical. move- 
ments (54), surface development, and shop mathematics. MA-2 and MA-3 
differed considerably from MA-1, containing mechanical information (23. 
53), mechanical comprehension (51, 50), and surface development; of 
which the first two were found to be good predictors for mechanics courses 
(58). MA-4, X-1 contained items on tool recognition, mechanical compre- 
hension, and surface development. Use of the MAT at reception centers, 
where scores were recorded for all except illiterates and Grade V men 
on the AGCT, continued until April 1945. It was supplanted by the 
AGCT-3a, which contains a surface development section similar to the 
MAT. Thereafter the MAT was used whenever deemed advisable at train- 
ing centers. 

Standardization, MA-1 was standardized on 3452 men (47). Standard 
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scores, With a mean of 100 and a sigma of .20, were calculated by equiv- 
alent percentiles yielding a breakdown in five Army grades which approxi- 
mated a normal distribution. MA-2 and -3, based on item analysis of trial 
forms (70, 82), were standardized (90) by equivalent percentiles for MA-1 
scores on a population of 2766 men. MA-4, X-] was standardized by linear 
transformation (154), and was item analyzed (180). 

Reliability. Estimates by the Kuder-Richardson Formula No. 21 (46, 47, 
90, 154), test-retest method (49, 163), and equivalent forms method (90) 
show satisfactory reliabilities for both total scores and subtests. 

Validity. Validity studies gathered a wide range of correlations, usually 
lowered by preselection, with course grades and other criteria. As a verbal 
test, the MAT correlates best with theoretical course grades (66, 227) and 
motor mechanics (48, 49, 89, 97, 103, 125) ; less well with driver perform- 
ance ratings (156, 172); and negligibly with radio code receiving speed 
(84, 201, 205). Varying results were obtained for clerks (94), aircraft 
warning operators (226), airplane mechanics (194, 201, 228, 229), basic 
trainees (68), and Navy trainees (52, 129). Validity of MA-4, X-1 with 
WAC specialist school grades as criteria (186) was superior to the AGCT 
for motor transport, but inferior for radio repair school. A study of MA-4, 
X-] for civilian armament trainees (239) found that its validity would be 
improved if the Surface Development Subtest were omitted. 

Relationship to Subtests and Other Tests. MA-2 and MA-3 were found 
to be superior to MA-1 in being less highly correlated with the AGCT (47, 
69, 87, 90, 95, 96, 226). Intercorrelations of total scores and subtests were 
high (47). Correlations were computed for the MAT and civilian mechani- 

‘cal aptitude tests (49, 52, 226, 227). 

Tests of Mechanical Aptitude for Civilians. A provisional battery, 
Mechanical Aptitude Test MA-5, was not as valid as standard civilian 
mechanical tests (250). A mechanical aptitude battery consisting of Learn- 
ing Ability Test LA-5 (an Air Corps test of mental ability), Tool Usage, 
Mechanical Problems, and Paper Form Board Test CG-106a did distinguish 
mechanics from nonmechanical workers (260). General Mechanical Apti- 
tude Test CM-142a yielded fair correlations with both final mechanical 
grades and supervisors’ ratings (263). A revision of this test, made shortly 
after V-J day, was designated CM-142ar (264). 


Clerical Aptitude 


Clerical Aptitude Test CA-1, completed in 1940, consists of 280 items on 
name checking, coding, catalog numbers, verbal reasoning, number 
checking, and vocabulary. It was standardized (20) on a group normally 
distributed by AGCT scores. Resulting distribution (63) was leptokurtic 
for both the standardization sample and field returns. Reliability by Kuder- 
Richardson Formula No. 21 was .95 (20), by test-retest method .72 (163). 
Validity coefficients are usually based on small populations, the criterion 
being clerical school grades (19, 47, 69, 94, 96). Since it is highly corre- 
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lated with the AGCT (21, 47, 69, 94, 96) and sometimes inferior as 
predictor of clerical grades to the AGCT, the Mechanical Aptitude Tes; 
(94), or the Wells Revision of the Army Alpha (19) its usefulness has 
been questioned. Experimental material for an uncompleted alternate form 
of CA-1 was used in constructing Clerical Aptitude Test CA-2, X-2 for the 
WAC, which covered classifying, cataloging, number and name checking, 
alphabetizing, and spelling. In the standardization the score distribution 
departed markedly from the normal, and the percentile method was used 
to set up standard score scales. The Kuder-Richardson reliability was .97 
(154, 179). Validity studies (186) with grades in specialist courses as 
criteria showed CA-2, X-2 to be inferior to the AGCT for administrative 
specialists. More widely used than any of the above clerical aptitude tests 
were those developed for civilian employees of the War Department, which 
in chronological order of use were the General Proficiency Test WCT-I, 
X-3 (151, 155); CA-2, X-2 (179, 185); a provisional battery, Clerical 
Aptitude Test CA-3 (261); and finally General Clerical Abilities Test 
CC-105a (241, 247, 262, 292). Part A of CA-3 correlated .32 with super- 
visors’ ratings in some jobs (233, 240). Parts of CC-105a had an average 
correlation of .35 with Civil Service CAF grade and supervisors’ ratings, 
but in a number of cases reached correlations around .50 (259, 293, 317). 


Army Trade Screening Tests 


To verify skill status in Military Occupational Specialties a series of 
Army Trade Screening Tests and Experience Check Lists in clerical, 
mechanical, and other technical fields was developed (10, 286, 287, 288). 
Reliabilities, estimated by Kuder-Richardson Formula No. 4 for eight of 
the tests, ranged from .87 to .93. Critical scores were set for most of the 
tests to represent the level of technical achievement attained by graduates 
of the corresponding Army specialist course. Critical ratios between ex- 
perienced and inexperienced men were high. Critical ratios dropped when 
examinees were encouraged to guess. 


Ill. Training 
Measurement of Academic Knowledge 


Measures of educational achievement in the armed forces gained a new 
importance with the inauguration of the Army Specialized Training Pro- 
gram. Tests in academic subjects such as Algebra, English, etc. were con- 
structed for the Air Forces (67, 87, 93), Corps of Engineers (68, 116), 
Coast Artillery (142), and the WAC (170). These tests were used as early 
selection devices until instruments of better validity were developed. A 
study was also conducted on the difficulty level and usefulness of a General 
Educational Test for Warrant Officer candidates constructed by the Co- 
operative Test Service (249). 
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The Army Specialized Training Program: Selection Tests. At the incep- 
tion of the Army Specialized Training Program (8) a test constructed 
for officer candidate selection, the OCT-2, X-3, was found useful as a gen- 
eral ability test (318, 319). It was a considerably better predictor of suc- 
cess in basic engineering courses than the AGCT (324, 334). For the Army 
Specialized Training Reserve Program, three Army-Navy Qualifying Ex- 
aminations were constructed by the College Entrance Board for screening 
applicants (320, 323, 331). A qualifying test (C-4), composed of mathe- 
matics and vocabulary items, was prepared by the Personnel Research Sec- 
tion for the same purpose, and was found to discriminate satisfactorily 
(341) among applicants. A Mathematics Inventory Test, from which the 
mathematics section of C-4 was derived, was used for placement in appro- 
priate curriculums, at the proper level of difficulty, and proved to be a good 
predictor of success in the ASTRP (340). A series of aptitude tests for 
professional medical training was also built (328, 329) and their relation- 
ship to AGCT (325, 326) and other educational factors (330) was studied. 
Certain achievement tests in mathematics and physics were used as selec- 
tion devices for some advanced courses. 

Achievement Tests. More than 150 different national achievement tests 
covering at least eight subjectmatter fields (6) were administered in all 
basic and advanced courses as a check on uniformity of content and ade- 
quacy of instruction given in approximately 200 different training units. 
These included tests for seven different foreign languages. A series of 
studies recorded reliabilities for the tests (6). Attempts were made to 
develop valid norms for test scores (322, 332). Item analyses on prelimi- 
nary forms (327, 337) aided in constructing more reliable forms of the 
tests. The validity of the tests as predictors of success in basic engineering 
was investigated (321). Studies of correlations between R and R-44W 
showed that guessing had little effect on test reliability (333, 335). A socio- 
economic study (339) determined that 30 percent of the trainees were 
receiving more education than their prewar plans contemplated. One result 
of the national achievement testing program was that many instructors 
who originally opposed objective testing in college courses came to accept 
its value. 


Military Training 


A Military Knowledge Test consisting of multiple-choice items and or- 
ganized in pictorial form thruout was developed to test the basic military 
knowledge required of all soldiers. This test evolved from an item analysis 
of several experimental forms. It was used as a device to determine whether 
men being redeployed needed refresher training. The test distinguished 
trained and untrained infantrymen; however, validities were around .30 
against the Soldier Performance Scale (See Chapter IV). 

Army Automotive Screening Battery. An Experience Check List, and 
Apprentice Mechanics Test, a Tool Usage Film Strip Test, and Distributor 
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and Valves Assembly and Use of Tools performance tests, administered ), 
the “successive hurdle” method, have been used with great success to 
screen army automotive students who could bypass elementary phases of 
training. Based on trial of many experimental forms and methods of scor- 
ing, the battery showed high validity (255). A subsequent follow-up study 
showed that students selected to skip beginning phases of training on the 
basis of these tests completed the course even more successfully than stu- 
dents taking the entire course (256). 


IV. Measures of Proficiency and Criteria 
Truck Driving and Machine Shop Performance 


Performance criteria for truck driver trainees consisted of a practical 
road test checklist with objective ratings on specific items and a general 
driving proficiency rating, usually on a 5-point scale. The reliabilities of 
road test ratings are as high as those usually available for practical per- 
formance criteria. An early attempt at extreme objectification (64) was 
abandoned because of poor results. In one study the biserial correlation 
between number of unsatisfactory items on a checklist and general ratings 
was .83 for 1982 men and .28 for 1454 men rated under somewhat differ- 
ent conditions (72). Weighted checklist scores had tetrachoric correla- 
tions between .51 and .82 with general ratings for a sample of 1717 men 
(86). Other reliabilities are recorded using the split-half method (126) 
and test-retest method (24) on checklist scores. The general conclusion 
was drawn that reliability can be increased by training the examiners. 

Three raters were used to rate examinees on performance on a list of 
common machine shop operations. Estimated agreement among raters was 
fair. Average reliability of all three raters was .80 (107). 


Soldier Performance Report 


Major use of the Soldier Performance Report was as a criterion for pre- 
dictor tests such as the Army General Classification Test and the Army 
Individual Test in an effort to screen potential satisfactory soldiers from 
poor risks before basic training. Two early experiments (98, 203) were 
reported. Another study, validating a group of induction station tests (208) , 
used a scale restricted to marginally satisfactory and unsatisfactory ratings. 
Contingency coefficients of reliability ranged (corrected) between .64 and 
-78. Somewhat lower coefficients were obtained for a much more restricted 
group on AGCT scores (211). In the validation of induction station tests 
an 8-point scale and a composite checklist were used (210). To validate 
a military knowledge test, a soldier performance scale (279), in which a 
superior noncommissioned officer rated the soldier on a 5-point description 
scale, showed satisfactory reliability (above .80) by rating and rerating 
comparisons and correlations between ratings by platoon leaders and 
platoon sergeants. 
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AAF Technical School Success 


Attempts were made to find criteria more reliable than academic course 
grades for predictors of success in AAF schools. Paired comparisons (229) 
showed high reliability for small groups of ratees. Closest approximation 
to on-the-job conditions was tried (290) on technicians in AAF combat 
units in the Z/. Five types of on-the-job rating were secured: rank in over- 
all job ability, paired comparisons, and a five-step scale on performance, 
personality, and over-all worth. Odd-even reliabilities were high except 
for personality. Intercorrelations were about .90. 


War Department Civilian Employees 


Dissatisfaction with the reliability of criteria in use for test predictors 
in clerical work led to some experimental work on supervisory ratings. 
Two ability rating scales and a trait scale were constructed (293). They 


showed less correlation with predictor test grades than did civil service 
grade. 


Officer Efficiency 


A criterion originally developed to validate devices for measuring lead- 
ership and personality fitness among officers became the backbone of sev- 
eral programs of officer selection, retention, and efficiency reporting. The 
adequacy of the criterion depends on the agreement of groups of officers 
intimately acquainted with the character and proficiency of given officers 
as to their placement in widely separated positions along a continuum 
of over-all competence (343). It was determined that a random group 
of ten rating officers can distinguish the over-all competence of officers 
almost as well as can a designated group of ten selected raters. In order 
that assignment to any criterion group be reliable, the officer being rated 
should be known well enough to be rated by at least seven out of ten 
raters. This procedure was perfected in the “Buddy Rating System” (295, 
297) in an experiment with Officer Candidate School classes. Pooled inde- 
pendent ratings by a group of “buddies,” when checked against ratings 
by the platoon officer, yielded a highly reliable criterion against which 
to measure selection instruments. The corrected split-half reliability varied 
between .81 and .91, and correlations between buddy and platoon officer 
ratings ranged from .51 to .59. Greater reliability (295) was obtained 
as length of acquaintanceship increased. A system was devised for assign- 
ing a criterion index score of from 0 to 60 to include, in addition to definite 
criterion groups of high, middle, and low competence, those men of more 
indeterminate status (298). A variant of the original pooled rating sys- 
tem, comparison of the officer with Army Officers in general and with off- 
cers of the same grade on a 20-point scale, was later developed and showed 
high correlation with the criterion index (266). 
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V. Studies of Leadership 
Officer Selection 


Tests, rating forms, officer evaluation reports, interviewing procedures, 
and other devices were developed and investigated for measuring back- 
ground, learning ability, and leadership qualities of officers and officer 
candidates. In 1941 two forms of the Higher Examination, H-1 and H-2. 
containing the most difficult vocabulary and arithmetic items, were con- 
structed. These forms were intended for more exact discrimination among 
candidates in Army Grades I and II on the AGCT. Tables of equivalent 
scores with the AGCT were prepared (74). Both forms correlate highly 
with each other and with the AGCT. Reliability coefficients are high. Un- 
due emphasis on speed caused examinations to be discontinued for officer 
candidate selection because the speed factor appeared to discriminate 
against the older men. Form H-I correlated .48 with final grades for sixty- 
seven engineer officer candidates (79, 85, 88). War Orientation Test, 
WOT-1, X-1, containing 100 five-alternative items on information about 
current events, had high reliability and gave significant differences in means 
between officer candidates and basic trainees, but had lower validity than 
the AGCT as a predictor of success at Officer Candidate School (153). 

Officer Candidate Tests, OCT-1 and OCT-2. Experimental forms con- 
tained items on comprehension of paragraphs and graphic material and 
on arithmetic reasoning, which were chosen from the Army Officers Train- 
ing Examination, a battery developed for the War Department by the Co- 
operative Test Service. Reliabilities for the first experimental form were 
not satisfactory but higher correlations with OCS grades were obtained 
than for the AGCT (148). Conversion tables to AGCT scores were pre- 
pared and an item analysis made (118, 167). The test was rejected because 
informational content ‘was taken from commonly used War Department 
manuals. Two final forms, OCT-1 and OCT-2, were constructed after item 
analysis of two new experimental forms (168) and standardized on 2000 
men (175). Reliabilities were .81 and .91 respectively by the Kuder- 
Richardson Formula No. 21. Both forms correlated highly with the AGCT 
and with years of education in an unselected population. Validity coefh- 
cients were high, both tests being far superior to the ACCT as predictors 
of academic success in Officer Candidate School (198). 

Leadership Studies. Early approaches were based on analysis of War 
Department and civilian literature on leadership, management, etc. Two 
rating scales were developed but not validated. Interview procedures and 
forms were also developed and analyzed (177). Projective technics were 
investigated by the administration of the Rorschach and Thematic Apper- 
ception Tests, sentence absurdities, picture absurdities, and Philo-Phobe list 
to fifty-two men. Most correlations with leadership ratings were low. Prac- 
tical difficulties precluded the use of these instruments on a large scale 
(177). Preference Inventory, PL-1, X-1 contained 100 groups of three activ- 
ities, each presumably preferred by the combat leader, administrative 
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leader, or the nonleader. Correlation with leadership ratings at OCS were 
insignificant (204). Leadership Test, L-1, X-1, requiring judgment in lead- 
ership situations, was also discarded (178). Combat reports from the North 
African campaign and analysis of leadership selection by British and 
German armies and of civilian research in the United States led to reexam- 
ination of leadership selection methods. Suggestions were made by the sub- 
committee on leadership of the American Psychological Association Emer- 
gency Committee on Psychology. Ernest Ligon, Consultant to the Secretar 

of War, reported on lack of uniformity in current officer selection proce- 
dures. A Combat Adaptability Rating Scale was used in conjunction with a 
series of tests including an interview, performance situation, and stress 
situations. Reliability of ratings was high, but low correlations were ob- 
tained between the rating scale and tests (225). No follow-up studies were 
made of individuals in actual combat because of administrative difficulties. 

Officer Retention Program. The largest, most successful, and most revolu- 
tionary program in leader selection was worked out for the program of 
selection from among temporary officers of those to be given permanent 
commissions and integrated into the postwar regular Army (4). Personnel 
instruments developed include an Officer’s Application for Commission; 
an Officer Classification Test, OCT-14, a test of general learning ability 
of suitable reliability but not adopted for other reasons; a General Survey 
Test of general educational achievement, including material from the 
fields of English usage, humanities, physical and biological sciences, and 
social sciences; a Biographical Information Blank; an Officer Evaluation 
Report, an improved efficiency rating device; and a Standard Interview 
Procedure, a new type board interview which was objective, reliable, 
uniform, and completely different from usual Army board proceedings. 
The General Survey Test is used as an initial hurdle, while scores on the 
Biographical Information Blank, Officer Evaluation Report, and Interview 
are combined to yield a composite score indicating over-all fitness. All in- 
struments have been shown to be valid for representative officer samples 
against rigid criteria of agreement by fellow officers as to each applicant’s 
over-all fitness (13). A general bibliography on leadership was compiled 
as background for the program (9). 

Construction of Selection Instruments. Preliminary forms were tried out 
on approximately 8000 officers and officer candidates. Two 125-item forms 
of the Officer Classification Test (OCT), containing sections on reading 
comprehension, arithmetic reasoning, and interpretation and judgment, 
were each administered to groups of 500 officers. Two final 110-item forms 
were constructed on the basis of item analysis (306). A study of equivalence 
showed appreciable difference between forms (305). Form A had good 
validity for predicting success in technical courses (307), but did not 
correlate with general criterion of officer competence used in major study 
(see Validity of Battery, below). Form 1 of the General Survey Test (GST) 
contained 200 items selected from two preliminary forms after item analysis 
of 1000 cases (306). Percents of applicants selected by various cut-off 
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scores by Arm and Service and educational level were determined (311), 
The Biographical Information Blank (BIB) (297) provided a means for 
objectively measuring elements of past experience and personal char. 
acteristics, experimentally determined to be significant for predicting 
officer success. Form E, the final form used, contained 204 items divided 
into four parts: eighty-two biographical items, twelve pairs of self-evalua- 
tion items, ninety-four officer description items in forty-seven groups, and 
sixteen multiphasic pairs of items. All technics of personality measurement 
which had shown promise were investigated and the “forced-choice” technic 
exploited. Nine types of items received preliminary trial. Self-description 
items were presented in quintets (295) containing two desirable, two 
undesirable, and one neutral alternative. The two desirable or undesirable 
characteristics were equal with respect to degree of desirability, but differed 
as to relative importance for officer success. Scale values had been obtained 
previously (296). One hundred ten pairs of items from the Minnesota 
Multiphasic Inventory, Form TC-8a, were tried (294). The criterion for 
item selection on the B/B consisted of “buddy ratings” by fellow officer 
candidates and a ranking by platoon officers. Reliability of the “buddy 
ratings” was high. Correlations of ratings with the ACCT and educa- 
tion is low, consistent with other studies. Alternatives were then analyzed 
as to correlations with high or low criterion groups of officers. Various 
methods of scoring and the effects of various cutting scores were analyzed. 
Development of an objective Officer Evaluation Report (OER) was 
begun with appraisal of current Army efficiency rating methods. The 
War Department AGO Form 67, Officer Efficiency Report, and the 
AAF Form No. 123, Officer Evaluation Report were subjected to in- 
tensive analysis including intercorrelations of sections and trait ratings 
and factor analysis (301, 302). The technic of collecting statements 
from officers and enlisted men concerning characteristics of good and 
poor officers and refining and scaling these statements was _ investi- 
gated (300). Also available were the findings on investigations of the 
“forced-choice” technic (295, 296). Discriminating power of every item 
in nine different types of rating scales was determined. The Interview 
is a standardized, objective procedure which breaks sharply with tradition. 
It is intended specifically to evaluate ability to deal with people. Board 
members observe behavior and record observations, then check descrip- 
tions, integrate these into ratings on specific areas of behavior, and finally 
evaluate candidate’s ability to deal with people. Objectivity was achieved 
by defining overt behavior that could be observed and judged during the 
interview and developing conversational situations designed to elicit this 
type of behavior (299). 

Validity and Reliability of the Entire Battery. Two purposes were in- 
tended: (a) to select officers who were outstanding in past and present 
performance of duty, and (b) to assure the ability of such officers to remain 
outstanding in the future. For achievement of the first aim, scores on the 
Officer Evaluation Report (OER) and the Biographical Information Blank 
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(BIB) for 3000 officers and on the Interview (INT) for 1359 officers were 
validated against job performance as evaluated by a large number of 
fellow officers. Approximately 13,000 officers were studied for development 
of criterion groups. Three groups, high, middle, and low, of about 1000 
men each were used, consisting of men clearly and consistently placed in 
these categories by fellow officers in battalions or similar groups and by 
commanding officers. To achieve the second purpose, scores of 3000 officers 
on the General Survey Test (GST) were compared with educational level 
attained and scores for 367 officers on the Officer Classification Test (OCT ) 
were compared with scores on the AGCT. A combined point-index based 
on the OER, the BIB, and the INT adequately differentiated officers on 
the basis of efficiency and did so in a manner far superior to the traditional 
Army board proceedings. Percent of most competent and least competent 
officers chosen by various cutting scores was determined. Mean score on 
the GST showed a high relationship to educational level achieved and 
showed high variability at each educational level. All instruments and the 
criterion were determined to have suitable reliability (298). 

New Officer Candidate Program. Instruments devised and validated in 
the integration program for officers were adapted for selecting candidates 
for Officer Candidate Schools among enlisted applicants of the Signal 
Corps (303, 304) on the basis of leadership. An interview procedure, a 
biographical information blank, a military report, and a recommendation 
blank were validated against pooled buddy ratings and platoon officer 
ratings at various stages of training. The AGCT and OCT were found to 
be satisfactory predictors of academic success. This work was expanded 
to include the development of officer candidate selection instruments on 
an Army-wide basis. Forms used in the Signal Corps study were revised 
(312) after analysis. 

Integration of Nurses into the Regular Army. Items for a biographical 
information blank (315) and an evaluation report (316) were secured 
from an analysis (285) of essays on good and poor nurses and from officer 
characteristics evaluated previously (296). 

Officer Efficiency Rating Methods. A thoro research program on officer 
efficiency reporting methods grew out of the investigation of the usefulness 
of the semiannual officer Efficiency Report, WD AGO Form 67, as a selec- 
tion device for the retention of wartime officers in the regular Army (298). 
Five methods were evaluated: the currently used WD AGO Form 67 (301) 
and AAF Form 123 (302); a forced ranking form FR-2 (272); a report 
(OER-A) using the rating checklist technic (273); and a report (FCL-2) 
using the forced-choice technic (274, 295, 296, 297, 308). These were 
validated against four separate criteria: (a) Position in criterion groups 
of high, middle, and low officers as rated by groups of fellow officers (295, 
297); (b) a criterion index score of from 0 to 60 based on these nomina- 
tions (266, 298); (c) an over-all rating on a 20-point scale in comparison 
with Army officers in general; and (d) comparison with officers of the 
same grade. Results showed a clear superiority of the FCL over the other 


593 











Review oF EpucaTIONAL RESEARCH Vol. XVIII, No. 6 


four instruments (265, 266, 267, 268). In consequence, a revised form. 
FCL-3, was tried with corroborative results (269). Later studies found 
that validity was increased by combining the rating checklist of the 
OER-A with the FCL-3 (309), tho the FCL alone is superior to the RC, 
alone; that forced ranking as used in FR-2 increased validity slightly when 
incorporated with other forms, but had low validity alone; and that indorse. 
ment of ratings improved validity slightly (271), but later training had 
little effect (270). 


VI. Tabulating and Analysis Technics 
Test Validity 


A formula was developed for estimating the reduction in size of corre- 
lation coefficient when mean scores are inserted for “no data” cases (234), 
as well as formulas (207) for estimating change in r and other statistical 
constants due to selection on a single variable, either predictor or criterion. 
An empirical study of effects on obtained correlation of restriction in 
range led to results fairly comparable with predictions on basis of Kelley's 
formula (95). A method for estimating the probability of obtaining a 
score at or above the mean on the criterion for any given score on the 
predictor variable (28) was further developed (144, 145) by a method 
for estimating the probability that an individual with any given score on 
the predictor will fall at or above any given critical score on the criterion. 
The original method was extended to make it applicable to evaluation of 
significance of differences in test means for two samples (111). Two 
methods were presented for estimating test efficiency (166). Another 
approach is given in Richardson’s formula (206) for interpreting a test 
validity coefficient in terms of increased efficiency of a selected group of 
personnel. A method was proposed for estimating the size of the sample 
required for test standardization (62). 

IBM Equipment. Maximum utilization of IBM equipment and elimina- 
tion of errors introduced by inaccurate usage were of some concern. An 
extra circuit added to the test scoring machine will give certain derived 
scores directly by shifting the zero point (43). Errors result from the use 
of Government Printing Office Answer Sheets when the test scoring machine 
is set for IBM Answer Sheets (101). Favorable conclusions were reached 
concerning possibility of utilizing No. 1 pencils instead of IBM pencils 
in marking answer sheets (135). Detailed steps necessary in checking the 
adjustment of the IBM Test Scoring Machine were reported (245). A 
tabulation was also made (45) of differences in scores between machine- 
scored and hand-scored answer sheets. 

Test Selection. A procedure was developed for estimating the proportion 
of the variance of the total scores on a test contributed by each of the 
parts (130). The effect of a suppressor variable upon Wherry test selection 
results was also discussed (243). 
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Item Analysis. Tetrachorics were found more reliable than item-test 
correlations obtained with use of Richardson’s Nomograph (80). Errors 
in the use of Richardson’s Nomograph to analyze items not attempted by 
all subjects were pointed out (44). The effect of guessing on biserial 
correlations obtained between items and true scores was found to vary 
with item difficulty (244). Maximum r obtainable with Adkins-Toops 
Quintile Formula was found to vary with difficulty of item (238). 

Test Reliability. Kuder-Richardson Formula No. 21 appears to under- 
estimate reliability of scores based on the average of two administrations. 
A new formula was suggested (133). Insofar as assumptions underlying 
Kuder-Richardson Formula No. 21 are met, the addition of zero scores 
will increase the magnitude of the N obtained for all r’s less than unity 
(81). Kuder-Richardson Formula No. 20 appears to overestimate the 
reliability of a test when the distribution of item difficulty is highly 
skewed (200). A technic was given (200) for computing practice effect, 
difference in difficulty of parallel forms, and difference in level of ability 
for two groups taking two forms of a test. 

Computing and Facilitating Tables, Nomographs and Work Sheets or 
Job Descriptions. The following devices for facilitating computations were 
suggested: (a) job description of Wherry-Doolittle test selection method 
(159); (b) job description and work sheet for computing Pearson r by 
“difference” (diagonal) method (187); (c) work sheet for applying 
Adkins-Toops simplified formulas for item-selection (238); (d) job 
description and work sheet for factor analysis involving thirty-five or fewer 
variables (124); (e) work sheet for correcting correlations for restriction 
on one variable (253); (£) item analysis against median split on total 
test score (Richardson’s Nomograph) (80); (g) expectancy figures based 
on validity coefficients for various Army tests (232); (h) table for chang- 
ing ranks in groups smaller than 100 to equivalent rank in a group of 
100 (193); (i) values of EX, TX2, EXY, TY, and XY? for values of N 
from one to twenty for each cell of a 13 x 13 scatterplot (188); (j) four- 


place table of FA for three-place values of p or q (141); (k) facilitating 


tables for obtaining standard AGCT scores from number of attempted items 
and number of right answers (33); (1) probable error of median for 
certain values of Q and N (28); (m) value of I-r?, \/1-r?, and oe 

Vv 1-r? 





for various values of r (2); and (n) value of — and a for 
values of r (11). V1-r? Vi1i-r? 
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Tests and Engineer Trainee Records. August 1941. 

PersonNeL Researcnu Section (Starr), ApyuTant GENERAL’s Orrice, War 
DepartTMENT. PRS Report No. 118. Report on Minimum Literacy Tests Given at 
Fort Belvoir. April 1941. 

PersoNNEL ResEARCH Section (Starr), ApyuTANT GENERAL’s OFFIce, WAR 
DepartTMENT. PRS Report No. 119. Correlation of AGCT-la Scores and Bennett 
a Apprehension Test Scores with Airplane Mechanics Course Grades. 

ay ‘ 

PersonNeL Researcn Section (Starr), Apyutanr GenerAt’s Orrice, War 
Department. PRS Report No. 121. A Study of Mechanical and Intelligence 


ie 
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Tests as Possible Predictors of Success in Motor Mechanics and Communi- 
cation Courses at Fort Sill. May 1941. 

. PersonNeL Researcw Section (Starr), ApyuTant GENERAL’s OFFice, War 
DepartTMENT. PRS Report No. 125. Procedure Used in Scaling the Non-Lan- 
guage Test 2abc. April 1941. 

. PersoNNEL Researcnw Section (Starr), ApyJutTant GENERAL’s OFrrice, WAR 
DepartTMENT. PRS Report No. 143. The Selection of Radio Operators and Me- 
chanical Students. March 1942. 

. PersonNeEL Researcn Section (Starr), ApyuTaANtT GENERAL’s Orrice, War 
teers. PRS Report No. 149. Standardization of Classification Test R-1. 

une 1941. 

. PeRSONNEL Researcn Section (Starr), ApyuTANT GENERAL’s OFFice, War 
DepartMeNt. PRS Report No. 151. Empirical Check on Sampling Effects and 
Size of Required Sample. October 1941. 

. PERSONNEL Researcu Section (Starr), ApyuTaANt GENERAL’s OFFICE, WAR 
Department. PRS Report No. 159. Tabulation of Clerical Aptitude-1 Army 
Grade Distributions from Field Returns to August 1, 1941. October 1941. 

. Personnet Researcn Section (Starr), ApyJuTANt GENERAL’s Orrice, War 
DepartTMENT. PRS Report No. 163. Reliability of the Road Test. October 1941. 

. PersonNEL Researcn Section (Starr), ApsuTANt GENERAL’s Orrice, War 
DeparTMENT. PRS Report No. 166. Reliability of the Code Learning Test and 
Relation to the Radiotelegraph Operator Aptitude Test. November 1941. 

. PersonNeL Researcn Section (Starr), ApyJuTANT GENERAL’s Orrice, War 
DeparTMENT. PRS Report No. 167. Comparison of Mechanical Aptitude-l Scores 
and Success in Signal Corps Post School, Fort Monmouth. November 1941. 

. Personnet Researcn Section (Starr), ApyJuTANT GENERAL’s OFFICE, WAR 
DeparTMENT. PRS Report No. 168. Item Analysis of Educational Achievement 
Tests, EA-1, X-1, Maxwell Field. September 1941. 

. Personne. Researcn Section (Starr), ApsJuTANt GENERAL’s Orrice, War 
Department. PRS Report No. 170. Validity of Classification Tests (the AGCT 
Non-Language Test, NL-2abc, Mechanical Aptitude, MA-1l, Clerical Aptitude, 
CA-1) for Engineer’s Training Course. October 1941. 

. PersonNEL Researcn Section (Starr), Apyutant GENERAL’s Orrice, War 
DeparTMENT. PRS Report No. 171. Study of the CA-1, MA-1, and AGCT Score 
Distributions of Selectees at Camp Croft, South Carolina. August 1941. 

. Personne. Researcn Section (Starr), ApyuTaANt GENERAL’s Orrice, War 
DeparTMENT. PRS Report No. 175. Comparison of Scores of “Mechanics” and 
“Non-Mechanics” at Camp Lee, Virginia, on Forms A and B of the Surface 
Development Test (Experimental MA-2, -3, S.D.), Mechanical Comprehension 
Test (Experimental MA-2, MC), and the AGCT. November 1941. 

. Personne. Researcn Section (Starr), ApJuTANT GENERAL’s Orrice, War 
Department. PRS Report No. 176. Reliability of Psycho-Physical Tests Used 
at Camp Lee, Virginia. October 1941. 

. Personnet Researcn Section (Starr), ApyutTAnt GENERAL’s Orrice, War 
Department. PRS Report No. 178. Summary of Fort Knox Driver Study. 
November 1941. 

. Personne. Researca Section (Starr), Apyutanrt Generaw’s Orrice, War 
Department. PRS Report No. 181. Studies on Prediction of Achievement by 
Prospective Bombardiers. June-December 1941. 

. PersonneL Researcn Section (Starr), ApyJuTAnt GENERAL’s OrFice, War 
DeparTMENT. PRS Report No. 182. Scaling of Higher Examinations, H-1 and 
H-2. November 1941. 

. Personne, Researcn Section (Starr), Apyuranrt GenerAt’s Orrice, War 
Department. PRS Report No. 193. Reliability of AGCT-la by Test-Retest 
Method. November 1941. (Supplement to the above, January 1942.) 

. PersonneL Researcn Section (Starr), Apyutant GeENERAL’s OFrrice, War 
Department. PRS Report No. 195. Validation of Night Vision Test. Deceraber 
1941. 

. Personne. Researcn Section (Starr), Apyurant Generaw’s Orrice, War 
Department. PRS Report No. 196. The Relation of MA-1, AGCT-la, and Edu- 


cation to Auto Mechanics Final Grades at Fort Knox, Kentucky. December 
1941. 
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78. PersonNeL Researcn Section (Starr), ADJUTANT GENERAL’s OFFICE, War %6 
DEPARTMENT. PRS Report No. 206. Night Vision and Its Relation to Race and 
Blood Sugar. 


79. PersonNeL Researcw Section (STAFF), ADJUTANT GENERAL’s OFFICE, War 


DepaRTMENT. PRS Report No. 208. Prediction of Final Course Grade from 9 
Higher Examinations, H-1, and Army Officer Training Examinations. Decem. 
ber 1941. 

80. PersonNEL Researcn Section (Starr), ApyJuTANT GENERAL’s OFFICE, War 9% 


DepARTMENT. PRS Report No. 211. Computation of Tetrachoric Correlations 
by Chesire-Saffer-Thurstone and Richardson Charts. 

81. PERSONNEL RESEARCH Section (STAFF), ADJUTANT GENERAL’s OFFICE, War 9 
DepartMENT. PRS Report No. 215. The Effect on the Reliability Coefficient 
of Adding Zero Scores to the Distribution of Scores. December 1941. 

82. PeRsONNEL ReseEARCcH Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 217. Differences in Test and Retest Scores on 10 
Experimental MA-2 (Mechanical Comprehension, Mechanical Information, 
and Surface Development) after (9) Weeks Training at Enlisted Men’s School, 

Fort Belvoir, Virginia. December 1941. 10 

83. PERSONNEL ResearcH Section (STAFF), ADJUTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 219. Relationship of Years of Education and 
Signal Corps Code Aptitude Test Scores to Final Course Grades. October 1941. 

84. Personnet Researcu Section (Starr), ADJUTANT GENERAL’s OFFICE, War 10 
DEPARTMENT. PRS Report No. 220. Prediction of Code Speed from AGCT, 

MA-1, CA-1, and ROA Tests. December 1941. 

85. PERSONNEL ResearcH Section (Starr), ApJuTANT GENERAL’s OFFice, War 10 
DEPARTMENT. PRS Report No. 222. Internal Evidences of Relative Difficulty 
of Higher Examinations, H-1 and H-2. July 1941. 


86. PERSONNEL Researcn Section (Starr), ApJuTANT GENERAL’s OrFice, War 1 
DEPARTMENT. PRS Report No. 223. Reliability of Camp Lee Road Test. Decem- 
ber 1941. d 


87. PERSONNEL ResearcH Section (STAFF), ADJUTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report 225. Summary of Academic and Aptitude Test 
Results for Bombardiers and Navigators at Maxwell and Ellington Fields. 
December 1941. 

88. PersoNNEL ResearcnH Section (Starr), ApyuTaAnt GENERAL’s Orrice, War 
DEPARTMENT. PRS Report No. 228. The Reliability of Higher Examinations, H.-1 
and H-2. January 1942. 

89. PersonNEL Researcn Section (Starr), ApyuTANT GENERAL’s OFFice, War I 
DEPARTMENT. PRS Report No. 229. Relation of MA-1, AGCT, and Education 
with Final Grades at the Tank Mechanics Course, Fort Knox, Kentucky. Jan- 
uary 1942. 1 

90. PersonneL ResearcnH Section (Starr), ApsJuTANT GENERAL’s OrFice, War 
DEPARTMENT. PRS Report No. 234. Report on the Standardization of MA-2 
and MA-3. January 1942. 1 

91. PersonneL Researcn Section (Starr), ApsuTANt GENERAL’s OrFice, War 
DeparTMENT. PRS Report No. 235. Selection of Radiotelegraph Operators. 

January 1942. 

92. PERSONNEL RESEARCH Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 236. Reports: On the Value of the Code Apti- 
tude Test and the Army General Classification Test for Predicting Success at 
Radio School; Relationship of Code Aptitude Test Scores to Musical Ability ; 
and Army General Classification Test Scores. December 1941. 

93. PERSONNEL ResearcH Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
DEPARTMENT. PRS Report No. 237. Final Report on the Scoring and Reporting . 
of Results on the Air Corps Achievement Examinations Given November 12, 

1941. January 1942. 

94, PersONNEL ReseaRcH Section (Starr), ApyuTANt GENERAL’s OrFrFice, War 
DEPARTMENT. PRS Report No. 242. Prediction of Final Grades of Graduates 
of the Clerical Course, Fort Knox, Kentucky, from AGCT, CA-l, and MA-! 

Scores. January 1942. 

95. PersonNEL Researcw# Section (Starr), ADJUTANT GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 244. The Effect of Restricted Ranges of Ability 
on Correlations Between AGCT and the Three Forms of the Mechanical Apti- 
tude Test. January 1942. 
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06. PERSONNEL ReseEARCH Section (STAFF), ADJUTANT GENERAL’s OFFICE, WAR 
DEPARTMENT. PRS Report No. 245. Tables of Scores of Commissioned Officers 
on the AGCT, Clerical Aptitude Test, and Mechanical Aptitude Test. January 
1942. 

. PERSONNEL Researcw Section (Starr), ApyuTANT GENERAL’s OFFICE, WAR 
DepaRTMENT. PRS Report No. 246. Grades of Motor Mechanics as Related to 
Part and Total Scores on MA-1 and AGCT, Camp Lee, Virginia. January 1942. 

. PERSONNEL ReseArcH Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
DepaRTMENT. PRS Report No. 252. Analysis of Soldier Performance Report 
Data. February 1942. 

. PERSONNEL Researcn Section (Starr), ApJUTANT GENERAL’s OFFrice, WAR 
DeparTMENT. PRS Report No. 253. Selection of Officer Candidates: Relation 
of AGCT, Education, and Other Variables to Success in Officer Candidate 
School. February 1942. 

. PERSONNEL Researcu Section (STAFF), ADJUTANT GENERAL’s OFFICE, WAR 
DEPARTMENT. PRS Report No. 255. Summary of Construction of Electricity and 
Radio Information TK-1, X-2. February 1942. 

. PersonNeL Researcw Section (STAFF), ADJUTANT GENERAL’s OFFICE, WAR 
DepaRTMENT. PRS Report No. 258. A Comparison of the Amount of Tolerance 
for Misplaced Answers Found in the GPO and IBM Machine-Scored Answer 
Sheets. February 1942. 

. Personne Researcn Section (Starr), ApJuTANT GENERAL’s OFFice, WAR 
DeparTMENT. PRS Report No. 266. Standardization of the Radiotelegraph Oper- 
ator Aptitude Test, ROA-1, X-1. May-November 1942. 

. PersonNeEL Researcw Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
DeparTMENT. PRS Report No. 267. Tables of Equivalents for Otis, Army Alpha, 
and AGCT Scores. 

. Personnet Researcn Section (Starr), ADJUTANT GENERAL’s OrFice, WAR 
DEPARTMENT. PRS Report No. 268. Notes on the Preparation of Conversion 
Tables from Army Alpha Raw Scores to Corresponding General Classification 
Test-la Standard Scores. December 1940. 

. PersONNEL Researcu Section (Starr), ApjuTANT GENERAL’s OFFICE, WAR 
DepartTMENT. PRS Report No. 269. Study of Tests for the Determination of 
Code Aptitude. 

. PersonNneL Researcn Section (Starr), Apyutant GeENERAL’s Orrice, War 
DepartTMENT. PRS Report No. 271. Selection of Items from the Army General 
Classification Test, AGCT-1b, for Classification Test, R-2. February 1942. 

. PersonneL Researcu Section (Starr), ApyuTant GENERAL’s Orrice, WAR 
Department. PRS Report No. 275. The Prediction of Machine Shop Perform- 
ance, Air Corps Technical School, Chanute Field, Illinois. March 1942. 

. Personne. Researcu Section (Starr), ApyutTant GENERAL’s OFrFrice, WAR 
Department. PRS Report No. 277. Prediction of Grades in Gunnery School 
from MA-1 and AGCT. March 1942. 

. PersonNeL Researcn Section (Starr), ApyuTANT GENERAL’s OFFice, WAR 
DepaRTMENT. PRS Report No. 278. Analysis of CA-2 Data Obtained at the 
Clerical Section of the Armored Force School, Fort Knox, Kentucky. March 
1942. 

. PersonNeL Researcn Section (Starr), ApjuTANT GENERAL’s OFFice, WAR 
DeparTMENT. PRS Report No. 283. Item Analysis: Automotive Information 
Test, TK-1, GAI-1, X-1. May 1942. 

. PersonNneL Researcn Section (Starr), ApyJuTANT GENERAL’s OFFIce, WAR 
Department. PRS Report No. 284. The Evaluation of Differences Between the 
Test Means for Two Sample Populations. March 1942. 

. Personne. Researcn Section (Starr), Apyurant GENERAL’s Orrice, WAR 
DepartTMENtT. PRS Report No. 286. Analysis of Wechsler Self-Administering 
Test Data. March 1942. 

. Personne. Researcn Section (Starr), ApyutTant GENERAL’s OFFICE, WAR 
DepartTMENT. PRS Report No. 287. Prediction of Auto Mechanics Final Grades 
from AGCT-la and MA-I Scores at Fort Knox, Kentucky. March 1942. 

. Personnet Researcn Section (Starr), Apyutant GENERAL’s OFFIceE, War 
DeparTMENT. PRS Report No, 291. Estimation of the Effect of Omitting Block 
Counting from the Army General Classification Test. March 1942. 
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115. Personne Researcn Section (Starr), ApyuTant GENERAL’s OFFice, War 
DEPARTMENT. PRS Report No. 292. Analysis of Block Counting Items of the 
AGCT. March 1942. 

116. Personne. Researcn Section (Starr), ApyutaANt GENERAL’s OrFice, War 
DEPARTMENT. PRS Report No. 301. Item Analysis of Arithmetic Test, EA-3, 
X-l. Selection of Wrong Alternatives. April 1942. 

117. PersonneL Research Section (Starr), ApsutTANtT GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 307. Interpretations of AGCT Test Scores of 
Negro and White Selectees. April 1942. 

118. PersonneL Researcn Section (Starr), ApyuTant GENERAL’s OFFice, War 
DepartTMENT. PRS Report No. 308. Summary of Status of OCT-1, X-1, Stand. 
ardization. July 1942. 

119. PersonneL ResearcnH Section (Starr), ApyJuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 309. Standardization of Classification Test R-2. 
April 1942. 

120. Personne. Researcu Section (Srarr), Apsutant GENERAL’s OFFIce, War 
DepaRTMENT. PRS Report No. 311. Driver Experience Inventory. August 1942. 

121. PersonneL Researcnw Section (Starr), ApyuTANt GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 312. Night Vision of Colored and White Soldiers, 
April 1942. 

122. PersoNNEL Researcu Section (Starr), Apyutant GENERAL’s OFFIce, War 
DEPARTMENT. PRS Report No. 313. Analysis of Test Scores of Apprentice 
Mechanics Motor Training Section, Quartermaster Replacement Training 
Center, Camp Lee, Virginia. October 1942. 

123. PersonNEL Researcn Section (Starr), ApyuTANtT GENERAL’s OrFice, War 
DEPARTMENT. PRS Report No. 314. Reaction Time and Accuracy Tests Used at 
Camp Holabird. April 1942. 

124. PersONNEL Research Section (SrarF), ApJuTANT GENERAL’s Office, War 
DEPARTMENT. PRS Report No. 319. Procedure for Factor Analysis of Studies 
Involving Thirty-Five or Fewer Variables. May 1942. 

125. PersonneL Researcn Section (Starr), Apyutant GeENeERAL’s OFfFice, War 
DeparTMENT. PRS Report No. 324. Grades in a Motor Mechanics Course as 
Related to Vocational Training, Civilian Occupation, and Test Scores on M A-2, 
MA-3, Enlisted Men and Officers, Fort Benning. May 1942 

126. Personne ResearcnH Section (Starr), Apyutant GeENERAL’s OrFice, War 
DepaRTMENT. PRS Report No. 325. Report on Analysis of Fort Knox Repeat 
Driver Tests, March 1942; Improvement on Road Test vs. Fort Knox Driver 
Tests. May 1942. 

127. PersonneL Researcu Section (Starr), Apyutant GeENERAL’s OFFIce, War 
DepartTMENT. PRS Report No. 326. Report on Standardization of WCT-1, X-2. 
May 1942. 

128. Personne. ResearcnH Section (Starr), Apyutant GENERAL’s Orrice, WAR 
DEPARTMENT. PRS Report No. 328. Study of Some Factors in Radio Operator 
Selection, Scott Field, Illinois. May 1942. 

129. Personne Researcu Section (Srarr), ApyutTant GENERAL’s OrFice, War 
DEPARTMENT. PRS Report No. 330. Grades in Maneuvers Course and Winch 
Mechanics Course at the Balloon Barrage Course, Camp Tyson, Tennessee, 
as Related to Each Other to Score on AGCT, Mechanical Aptitude, MA-1, and 
Clerical Aptitude, CA-1. June 1942. 

130. Personnet Researcn Section (Starr), Apyutant GENERAL’s OrFice, WAR 
DeparTMENT. PRS Report No. 331. A Procedure for Estimating the Proportion 
%. the Total Scores on a Test Contributed by Each of the Parts of the Test. 

une 1942. 

131. Personnec Researcnu Section (Starr), ApyutaANt GENERAL’s OrFice, WAR 
DEPARTMENT. PRS Report No. 334. Reliability of Fort Belvoir Night Vision 
Tests, June 1942; Hopkins Night Vision Test (Day to Day Reliability). July 
1942. 

132. PersonneL Researcu Section (Starr), Apyutant GeENeRAL’s Orrice, War 
DeparTMENT. PRS Report No. 338. Success in Officer Candidate Courses Re- 
lated to AGCT Scores and Other Variables. July 1942. 

133. Personnet Researca Section (Starr), Apyurant Generaw’s Orrice, War 


DepartMENT. PRS Report No. 339. Computation of Test Score Reliabilities. 


May 1942. 
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34. PERSONNEL Researcn Section (Starr), Apyutant GENERAL’s Orrice, WAR 
oe PRS Report No. 340. Accident Record vs. Psychological Tests. 
July 1942. 

. Personnec Researcn Section (Starr), Apyutant GeNerAt’s Orrice, WAR 
DepartTMeNntT. PRS Report No. 344. The Effect of the Use of No. 1 Pencils on 
the Accuracy of Scoring IBM Answer Sheets by Machine. July 1942 

. PersONNEL Research Section (SrarF), ApJuTANT GENERAL’s Orrice, WAR 
DepARTMENT. PRS Report No. 347. Reliability of Personnel Form P-1. August 
1942. 

. PersONNEL Researcn Section (Starr), ApJuTANT GENERAL’s OFFICE, WAR 
DeparTMENT. PRS Report No. 350. Analysis of Visual Classification Test, VC-1, 
X-1 Data from Camp Croft. July 1942. 

. PersoNNEL Researcw Section (Starr), ApyuTANT GENERAL’s OFFICE, WAR 
DeparTMENT. PRS Report No. 354. Standardization of the Visual Classification 
Test, VC-1, X-2, August 1942; Supplement: A Standard Score Scale for the 
Visual Classification Test, VC-1, X-2. September 1942. 

. PERSONNEL RESEARCH SECTION (Starr), ApyutaANt GENERAL’s OrFiceE, WAR 
DepaRTMENT. PRS Report No. 356. Relation of Failure to Army General "Classi- 
fication Test, Fort Monmouth, New Jersey. 

. PERSONNEL RESEARCH SECTION (Starr), ApyuTANT GENERAL’s OrFice, WAR 
DeparTMENT. PRS Report No. 358. Validation of Tests for Selection of Radio 
Operators, ROA-1, X-1; CLT-2, X-3; and Substitution Test. August 1942. 

. PersonneL Researcn Section (Starr), ApJuTANT GENERAL’s Orrice, WAR 
DepartTMENtT. PRS Report No. 360. Four-Place Tab'e of pq/z for Three-Place 
Values of p or q. 

. Personne. Researcn Section (Starr), ApyuTANT GENERAL’s OrFice, WAR 
DepartMENT. PRS Report No. 363. Tables for Use in Converting Scores on AG 
Tests to Those on Coast Artillery Entrance Examinations. August 1942. 

. PersonneL ResearcnH Section (Starr), ApJuTANT GENERAL’s OrFice, WAR 
DeparTMENT. PRS Report No. 371. Analysis of Attempts on Each Type of AGCT 
Item by Grade V Men in Regular and Special Training. September 1942. 

. PersonNeEL Researcn Section (Starr), ApyJuTANT GENERAL’s OFFice, War 
oe. PRS Report No. 375. The Computation of Expectancy Tables. 

une 1942. 

. PersonNEL Researcn Section (Starr), ApjJuTaNt GENERAL’s OrFice, WAR 
DepaRTMENT. PRS Report No. 375a. Interpretation of Correlation Coefficients 
in Terms of Expected Performance in One of the Associated Variables. August 
1945. 

. PersonNeL Researcn Section (Starr), Apjutant GENERAL’s OFFICE, WAR 
Department. PRS Report No. 378. Reliability of Radiotelegraph Operators 
Aptitude Test, ROA-1, X-1. September 1942. 

. PersonNneL Researcnu Section (Starr), ApyJutaNt GENERAL’s OFFICE, WAR 
Department. PRS Report No. 379. Driver Experience Inventory #2 (Camp 
Pickett Validation Data). October 1942. 

. Personne Research Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
Department. PRS Report No. 380. Validity of Officer Candidate Test, OCT-I, 
X-1. October 1942. 

. Personne Researcn Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 381. Personnel Form P-1 (Also called Shipley 
Personality Inventory and the Personnel Form R-2). April 1942. 

. Personne Researcn Section (Starr), ApJuTANT GENERAL’s OrFice, War 
DeparTMENT. PRS Report No. 382. Analysis of Mental Alertness-1, X-2 Test 
Results for Female Students at Mount Vernon Seminary, Woodrow Wilson 
High School, Trinity College, and Catholic University. October 1942. 

. Personne. Research Section (Srarr), Apyutant GeEneraw’s Orrice, War 
DepartMENtT. PRS Report No. 385. Standardization of the General Proficiency 
Test, WCT-1, X-3. October 1942. 

. Personne: Researcn Section (Starr), Apyutant Generaw’s Orrice, War 
DepartMeNtT. PRS Report No. 386. Test Scores of Accident vs. Non-accident 
Drivers. October 1942. 

. Personne: Researcn Section (Starr), Apyutant GenerAt’s Orrice, War 


DepartMENt. PRS Report No. 389. Evaluation of War Orientation Test-1, X-1. 
June 1942. 
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154. Personne. Researcn Section (Starr), Apyutant GENERAL’s OFFice, War 
DeparTMENT. PRS Report No. 392. Standardization of WAAC Specialist Tests. 
November 1942. 

155. PersonNeL Researcn Section (Starr), ApyuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 393. Analysis of Responses to Each Alternative 
of Each Item in the General Proficiency Test, WCT-1, X-3. November 1942, 

156. Personne, Researcn Section (Starr), ApyuTant GENERAL’s Orrice, War 
DEPARTMENT. PRS Report No. 394. Driving Performance vs. Experience and 
Test Scores (Fort Knox Data). November 1942. 

157. PersonNEL Researcu Section (Starr), ApJuTANT GENERAL’s OFFice, War 
DeparTMENT. PRS Report No. 401. Analysis of Responses to Each Alternative 
Made by Men Tested at Induction Stations: Visual Classification Test, VC-], 
X-2. November 1942. 

158. PersonneL Researcn Section (Starr), Apyutant GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 402. Analysis of Responses to Each Alternative 
of Each Item for Grade V Men in Special Training Units and in Regular 
Training Units. November 1942. 

159. PersoNNEL Researcu Section (Starr), ApjutANt GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 403. The Wherry-Doolitile Test Selection Method. 

160. PersonNeL Researcu Section (Starr), ApJUTANT GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 406. Analysis of Responses to Each Alternative 17 
of Each Item. December 1942. 

161. Personne: Researcn Section (Starr), ApyurANt GENERAL’s OFFIce, War 
Department. PRS Report No. 410. Development of Improved Radio Code 17 
Aptitude Tests. March 1942. 

162. PersoNNEL Researcu Section (Starr), ApyutTaANt GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 411. Selection of Men for Training as Medical 


172 


173 


174 


17: 


7 
Technicians. Evaluations of Procedures Used at Camp Lee and/or Camp ! 
Pickett, Virginia; Billings General Hospital, Fort Harrison, Indiana; Walter 
Reed Hospital, Washington, D. C. January 1943. 1g 


163. PERSONNEL ResearcnH Section (Starr), ApyuTaANt GENERAL’s OFFIce, War 
DeparRTMENT. PRS Report No. 412. Relation Between Original Tests (MA, CA, 
and AGCT) Given at Reception Centers and Retests Given at the Armored if 
Force School, Fort Knox. December 1942. 

164. PersonneL Researcn Section (Starr), ApysuTaANt GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 418. Selection of Officer Candidates: Validity 
of Officer Candidate Test, OCT-1, X-1, for Predicting Academic Success of the 
West Point 1942 Class. January 1943. 

165. PersonNEL Researcu~ Section (Starr), ApyuTANT GENERAL’s OFFice, War 
DEPARTMENT. PRS Report No. 420. ACE Psychological Examination (1942 ed.) 
Raw Scores Equivalent to AGCT Standard Scores. August 1943. 

166. PersonNeL Researcn Section (Starr), ApyutANt GENERAL’s OrFice, War 
DepartTMENT. PRS Report No. 421. Methods for Estimating Test Efficiency. 
August 1943. 

167. PersonneL Researcn Section (Starr), ApyuTANT GENERAL’s Orrice, War 1 
DEPARTMENT. PRS Report No. 422. Officer Candidate Test, OCT-1, X-1, Item 
Analysis Based on Samples of Fort Belvoir and Camp Lee Officer Candidates. 

March 1943. 

168. Personne. Researcn Section (Starr), ApyuTaANt GeNerAL’s Orrice, War ] 
DeparTMENT. PRS Report No. 424. Officer Candidate Tests OCT-2, X-1, and 
X-2, Item Analysis Based on Camp Lee Officer Candidates and Compilation 
of OCT-1 and OCT-2. March 1943. 

169. Personne Researcn Section (Starr), ApyutTant GENERAL’s OrFice, War 
DepartTMENT. PRS Report No. 425. Item Analysis of Automotive Information 
Test, TK-1, X-2, Fort Meade, Maryland. June 1942. | 

170. Personnet Researcn Section (Starr), ApsutaANt GeENeRAL’s Orrice, War 
DeparTMENT. PRS Report No. 427. Arithmetic Test, EA-3, X-2. Item Analysis 
Based on Sample of W AAC Auxiliaries. March 1943. 

171. PersonneL Researcu Section (Starr), ApyutTaNnt GENERAL’s OrFice, WAR 
DeparTMENT. PRS Report No. 428. Validity of Military Police Test Battery 
for Predicting Course Grades at Provost Marshal General’s School, Fort Ogle- 
thorpe, Georgia, March 1943; Standardization of Reading and Reporting-!, 

X-1, for Military Police Officer Candidates. May 1943. 


l 
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172. PersoNNEL Researcn Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
DEPARTMENT. PRS Report No. 430. Selection of Truck Drivers: Driver Experi- 
ence Inventory #2, Driver Information Test #9, and other Measures, Camp 
Lee, Virginia. March 1943. 

. PersoOnNNEL Researcn Section (Starr), ApyutaANt GENERAL’s OFFICE, WAR 
DeparRTMENT. PRS Report No. 432. General Classification Test, GCT-lc or 1d. 
Test-Retest Differences for Enlisted Men Who Score in Grade V on Original 
Test. April 1943. 

. PersonNEL Researcn Section (Starr), ApyuTaANt GENERAL’s Orrice, WAR 
Department. PRS Report No. 433. A Comparison of the AGO Experimental 
Battery, W PQ-1, X-1 and the West Point Qualifying Examinations for Predic- 
tion of First Term Academic Performance of Fourth Classmen Entering July 
1942. April 1943. 

5. PersONNEL Researcn Section (Starr), ApyJuTANtT GENERAL’s OFFICE, WAR 
DepartTMENT. PRS Report No. 437. Selection of Officer Candidates, Standardi- 
zation and Validation of OCT-1 and OCT-2 at Fort Benning and Fort Monmouth. 
August 1943. 

. PersoNNEL Researcu Section (Starr), ADJUTANT GENERAL’S OrFice, WAR 
DeparTMENT. PRS Report No. 439. Norms for Driver Information Tests DIT-9 
and DIT-10. January 1943. 

. Personne. Researcn Section (Starr), ADJUTANT GENERAL’s OrFice, War 
DeparTMENT. PRS Report No. 444. Selection of Leaders, Status of the Measure- 
ment of Leadership. April 1943. 

. Personne. Researcw Section (Starr,;, ApJUTANT GENERAL’s OFFICE, WAR 
DepaRTMENT. PRS Report No. 446. Selection of Officer Candidates, Validation 
Study of Leadership Test L-1, X-1, at Fort Belvoir and Fort Benning. June 1943. 

. PERSONNEL Researcn Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
DepartTMENT. PRS Report No. 447. Standardization of Clerical Aptitude Test, 
CA-2, X-2 for War Department Civilian Personnel. November 1942. 

. PersoONNEL Researcn Section (Starr), ApyJuTANT GENERAL’s Orrice, WAR 
DepartTMENT. PRS Report No. 449. Mechanical Aptitude Test MA-4, X-1: Item 
Analysis Based on Sample of WAAC Auxiliary. July 1943. 

. PersonNeL Researcu Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
DeparTMENT. PRS Report No. 450. The Validation of an Experimental Battery 
of Combat Intelligence Tests at Camp Blanding, Florida. June 1943. 

. PersoONNEL Researcn Section (Starr), ApyuTaANt GENERAL’s OrFFiIce, War 
DepartTMENT. PRS Report No. 451. Comparison of Scoring Formulas “Rights 
and Rights Minus 1/3 of the Wrongs” Based on the Results of West Point 
ge on AGCT-ld, Elementary Math-l, X-1, and Language Aptitude-l, X-l. 

uly 1943. 

. Personne. Researcn Section (Starr), Apyutant GENERAL’s Orrice, War 
DeparTMENT. PRS Report No. 452. Item Analysis: General Automotive Infor- 
mation Test, TK-7, X-1, Normoyle Ordnance Motor Depot, San Antonio, Texas. 
September 1943. 

. Personnet Researcn Section (Starr), Apyutant GENERAL’s OFFICE, WAR 
DEPARTMENT. PRS Report No. 453. Comparison of Performance of Women 
(WAAC ’s) with That of Men: Radiotelegraph Operator Aptitude Test, ROA-1, 
X-1. July 1943. 

. Personnet Researcn Section (Starr), ApyJuTaNtT GENERAL’s OrFice, WAR 
Department. PRS Report No. 457. Clerical Aptitude Test CA-2, X-2: Item 
Analysis Based on War Department Civilian Employees. August 1943. 

. Personne. Researcn Section (Starr), ApyuTaANtT GENERAL’s Orrice, War 
Department. PRS Report No. 459. Validation of Tests of Selection of WAAC 
Trainees for Basic and Specialist Schools. August 1943. 

. Personne. Researcn Section (Starr), ApyJuTaAnt GENERAL’s OFrFice, WAR 
DeparTMENT. PRS Report No. 462. Procedure for Computation of the Pearson 
Product Moment Coefficient of Correlation Using Special Computation Chart. 
October 1943. 

. Personne. Researcn Section (Starr), ApyutaANnt GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 465. Values of =X, =X*, ZXY, ZY, and TY’, 
when N varies from 1 to 20 for each Cell of a 13 x 13 Scatterplot. 

. Personne Researcn Section (Starr), ApyuTANT GENERAL’s OFFiceE, War 
DepartTMENT. PRS Report No. 466. Development of Basic Classification Battery. 
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The Influence of General Information in the Reading and Vocabulary Tes. 
September 1943. 

190. PersonNEL Researcn Section (Starr), Apyutant GeENERAL’s OFFICE, War 208 
DeparTMENT. PRS Report No. 468. Selection of West Point Cadets. March 1944. 

191. Personne, Researcn Section (Starr), Apyutant GeNERAL’s OFFice, War 
DepartTMENT. PRS Report No. 469. Validation of Women’s Classification Test, 

WCT-2, as a Predictor of Success in WAC Officer Candidate Schools, For; 205 
Oglethorpe. March 1944. 

192. PersonNeEL Researcu Section (Starr), Apyutant GENERAL’s OFFice, Wax j 
Department. PRS Report No. 470. Standardization of Reception Center Special 21¢ 
Training Unit Tests, Fort Ontario. November 1943. 

193. PERSONNEL Research Section (Starr), ApyuTant GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 474. Rating Procedures for Measuring Per. 


formance. Paired Comparison and Rank in 100. November 1943. 2] 
194. PeRsONNEL Researcn Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DeparRTMENT. PRS Report No. 475. Interim Report on AAF Ground Crew Clas- 21; 


sification Test. August 1943. 

195. PersoNNEL ResearcnH Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DepartMENT. PRS Report No. 476. Validation of Radiotelegraph Operator 
Aptitude Test, ROA-1, X-1, and the Code Learning Test, CLT-2, X-3, Fort 21: 
Knox. November 1943. 

196. PersoNNEL Researcn Section (Starr), Apyutant GeENERAL’s OrFice, War 
DeparTMENT. PRS Report No. 477. Statistical Summary on the Aptitude Test 21: 
Studies at the Second Air Force Intelligence School, 18th Replacement Wing, 

Salt Lake City, Utah. December 1943. 

197. PersoNNEL Researcnu Section (Starr), ApyuTant GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 483. Relationship of Classification Test R-1 and 
WAC Classification Test, WCT-2, for a Recruiting Station Population. January 21: 
1944, 

198. PersoNNEL Researcnu Section (Starr), ApJuTANT GENERAL’s OFFice, War 
DepaRTMENT. PRS Report No. 484. Validity of the Officer Candidate Tests {or 
Predicting Academic Success at the Tank Destroyer and Transportation Corps 
Officer Candidate Schools. January 1944. 21 

199. PersONNEL Researcu Section (Starr), ApyutAnt GENeERAL’s OrFice, War 
DeparTMENT. PRS Report No. 485. Relationship of WAC Classification Test, 

WCT-2 to Army General Classification Test for W AC Applicants. January 1944. 21 

200. PeRsoNNEL Research Section (Starr), ADJUTANT GENERAL’s OFFICE, War 
DepartTMENT. PRS Report No. 486. Technique for the Comparison of Two 
Groups on Two Forms of a Test. January 1942. 

201. PersonneL Researcw Section (Starr), ApyuTaAnt GENERAL’s OrFFice, War 21 
DeparTMENT. PRS Report No. 488. Validation of AAFTTC and AGO Aptitude 
Tests. October 1942. 

202. PersoNNEL ResearcnH Section (Starr), Apyutant GeENERAL’s OrFice, War 
DeparTMENT. PRS Report No. 493. Construction and Standardization 0} 21 
Women’s Classification Test, WCT-2, to Replace WCT-1 for WAC Recruiting. 

September 1944. 
203. PersONNEL ResearcnH Section (Starr), ApJUTANT GENERAL’s OFFICE, WAR 





DeparTMENT. PRS Report No. 499. The Use of Age-Grade Placement and Civil 2 
Success Data in Predicting Scores on the Soldier Performance Report. Apri! 
1942. 

204. PersonNEL Researcn Section (Starr), ApJuTaAnt GENERAL’s OFrFIce, Wark 2 


DeparTMENT. PRS Report No. 500. The Validity of Preference Inventory (PL-!, 
X-1) for Prediction of Leadership Ratings at the Infantry and Engineer Officer 
Candidate Schools. March 1944. 2 
205. PersonneL Researcu Section (Starr), ApyuTaAnt GeENeERAL’s Orrice, WaAR8 
DeparTMENT. PRS Report No. 501. Report on the Development of Machine- 
Scores Code Aptitude Test, ROA-2, X-1. July 1943. 2 
206. PersonNeL Researcn Section (Starr), ApyuTaANnt GeENeERAL’s OrrFice, War 
DeparTMENT. PRS Report No. 502. Article: Interpretation of a Test Validity 
Coefficient in Terms of Increased Efficiency of a Selected Group of Personnel. 2 
by M. W. Richardson. April 1944. 
207. PersonneL Researcn Section (Starr), ApyutaAnt GeENeERAL’s Orrice, War 
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DeparTMENT. PRS Report No. 504. Article: Estimation of the Change in Certain 
Statistical Constants Due to Selection on a Single Given Variable. April 1944. 

208. Personne Researcu Section (SrarF), ADJUTANT GENERAL’s OFFICE, WAR 
DeparRTMENT. PRS Report No. 506. Analysis of a Rating Scale for the De- 
termination of Marginally Satisfactory and Unsatisfactory Soldiers, Fort Mc- 
Clellan. May 1943. 

. PersonneL Researcn Secrion (Starr), Apyutant Generat’s Orrice, WAR 
DEPARTMENT. PRS Report No. 510. Validation of Induction Station Tests-l, 
Fort Belvoir. March 1943. 

. PersoNNEL Researcn Section (Starr), ApyJuTaANt GENERAL’s OrFice, WAR 
DepaARTMENT. PRS Report No. 511. Validation of Induction Station Tests Il, 
A Preliminary Study at Camp Pickett Medical Replacement Training Center. 
March 1943. 

. PersonNeEL Researcu Section (Starr), ApJuTANT GENERAL’s OFFICE, WAR 
DeparTMENT. PRS Report No. 512. Validation of Induction Station Tests III, 
Fort McClellan. May 1943. 

212. Personne, Researcn Section (Starr), ApyJutaNt GENERAL’s Orrice, WAR 
DeparTMENT. PRS Report No. 514. The Selection of Inductees at Induction 
Stations—The Comparability of Qualification Test, Q-1, and Qualification Test, 
Q-2, First, Fourth, and Fifth Service Commands. October 1943. 

213. PersonneL Researcn Section (Starr), ApyuTANt GENERAL’s OrFice, War 
DepartTMENT. PRS Report No. 515. A Follow-Up of the Induction Station Test 
Validation Study at Fort McClellan IRTC. April 1944. 

. PERSONNEL Researcn Siction (Starr), Apyutant GENERAL’s Orrice, War 
DeparTMENT. PRS Report No. 516, The Validation of Induction Station Test 
V: The Relationship between the Scores on the Experimental Induction Station 
Tests and Success in Reception Center Special Training Unit at Fort Leaven- 
worth, Fort Benning, Camp Robinson, and Fort Sam Houston. April 1944. 

215. Personne: Researcu Section (Starr), ApJuTANT GENERAL’s Orrice, WAR 
Department. PRS Report No. 517. Standardization of Group Target Test, 
GT-1, Individual Examination, IE-1, Group Orientation, GO-1, Individual Target, 
IT-1, Visual Classification, VC-la, and Non-Language Individual Examination, 
NIE-1. April 1944. 

. PersonNEL Researcn Section (Starr), Apyutant GeENERAL’s Orrice, WAR 
DepartTMeNt. PRS Report No. 518. Validation of Induction Station Tests, Six 
Supplements. May 1944. 

. Personnet Research Section (Starr), Apyutanrt GENERAL’s OrFice, WAR 
DeparTMENT. PRS Report No. 519. Differential Patterns of Item Attempts on 
the Army General Classification Test Exhibited by Grade IV and V Men Tested 
at the Reception Center, Fort Leavenworth, Kansas, 1944. April 1944. 

. Personnet Researcn Section (Starr), ApyutTant GeENERAL’s OFFice, WAR 
Department. PRS Report No. 521. The Validity of the Wechsler Mental 
Ability Scale as a Predictor of Soldier Performance Ratings of STU Trainees. 
April 1944. 

. Personne, Researcn Section (Starr), ApyuTaAnt GENERAL’s OFFICE, WAR 
DepartTMENT. PRS Report No. 522. West Point Selection Examination for Pre- 
diction of First Term Academic Performance of 1943 Fourth Classmen. June 
1944. 

. Personne Researcu Section (Starr), Apyurant GENERAL’s Orrice, War 
DepartTMENT. PRS Report No. 527. Standardization of the West Point Qualify- 
ing Examination, WPQ-1, for the 1944 Fourth Class. June 1944. 

. Personne: Researcn Section (Starr), Apyutant GENERAL’s Orrice, WAR 
Department. PRS Report No. 528. AlT-Further Validation of the Shoulder 
Patch Test Executed at the ERTC, Fort Belvoir, Virginia. June 1944. 

. Personnet Researcn Section (Starr), Apyutant GeENERAL’s Orrice, War 
Department. PRS Report No. 529. AIT Validation Study at the QMRTC, Camp 
Lee, Virginia. June 1944. 

. Personne: Researcn Section (Starr), Apyutant GENeERAL’s Orrice, War 
Department. PRS Report No. 530. Report on Radio Code Aptitude Tests. De- 
cember 1944. 

. Personne: Research Section (Srarr), Apyutant Generat’s Orrice, WAR 
Department. PRS Report No. 530. Report on Radio Code Aptitude Tests: Part 
I, Validation; Part Il, Standardization. May 1945. 
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225. PERSONNEL ResearcH Section (Starr), ADJUTANT GENERAL’S OFFICE, Wan 
DeparTMENT. PRS Report No. 532. Tank Destroyer School, Camp Hood: 
Experiment in Combat Adaptability. December 1943. 

226. PERSONNEL RESEARCH Section (Starr), ADJUTANT GENERAL’S OFFICE, War 
DeparTMENT. PRS Report No. 543. Current Status and Recommendations Re. 
lating to Tests for Classification of Aircraft Warning Trainees. February 1944. 

. PersonNeEL ResearcH Section (Starr), ApyutaANt GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 545. Performance and Written Tests and Personal 
Data Factors as Predictors of Grades of Enlisted Air Crew Radio Mechanics 
at Scott Field. November 1944. 

28. PeRsoNNEL ResearcH Section (Starr), ApyJuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 546. Validation of Practical Performance and 
Other Technical Tests at Keesler Field, Airplane Mechanics. October 1944. 

. PERSONNEL RESEARCH Section (STAFF), ADJUTANT GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 550. The Relative Validities of Performance 
Aptitude and Written Tests for the Prediction of Success in Aircraft Armorers 
School at Buckley and Lowry Field, Colorado. August 1944. 

. PERSONNEL ReseEARCH Section (Starr), ADJUTANT GENERAL’S OFFICE, War 
DeparTMENT. PRS Report No. 551. Standardization of the Army Individual 
‘Test (AIT-1) Camp Barkeley, Texas, May 1944. August 1944. 

. PERSONNEL ResearcH Section (Starr), ApJUTANT GENERAL’s OFFICE, War 
DepaRTMENT. PRS Report No. 553. Development of the Weather Aptitude 
Test, TC-3a, for Predicting Academic Success at Weather Observer Schools. 
August 1944. 

. PersONNEL ResearcH Section (Starr), ApJUTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 562. Re-evaluation of Expectancy Tables in 
Easily Understood Terms Which Are Comparable from One Test to Another. 
September 1944. 

. PersonNNEL ResearcH Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DepartMENT. PRS Report No. 563. Analysis of Data on Mental, Mechanical, 
Clerical, Motor, and Visual Tests from Philadelphia Quartermaster Depot. 
September 1944. 

. PersoNNEL Researcu Section (Starr), ApyuTANT GENERAL’s OrFice, War 
DEPARTMENT. PRS Report No. 564. Estimating the Effect on Correlations of 
Inserting Mean Scores for “No Data” Cases. September 1944. 

. PersonNEL Researcu Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 567. Selection of Alternatives for Arithmetic 
Reasoning Test, Experimental Forms 1, 2, 3, and 4, for the AGCT-3. October 
1944, 

. PersonneL ResearcwH Section (Starr), ApyuTANtT GENERAL’s OrFice, War 
DepartTMENT. PRS Report No. 568. Development of AGCT-3 and Information 
Tests. August 1945. 

. PersoNNEL Researcu Section (Starr), ApjJuTANT GENERAL’s OFFIce, War 
DEPARTMENT. PRS Report No. 590. Summary Report on Warrant Officer Ex- 
aminations. October 1944. 

38. PersonNEL ResearcH Section (Starr), ApJUTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 596. Procedures for Applying the Adkins-Toops 
Simplified Formulae for Item Selection. October 1944. 

. PersonNEL ResearcH Section (Starr), ApJUuTANT GENERAL’S OFFICE, WAR 
DEPARTMENT. PRS Report No. 597. Validity of AGO Tests as Predictors o/ 
Success in Rock Island Armament Maintenance School, and Rock Island 
Arsenal Sub-Office at Dunwoody Institute. November 1944. 

. PERSONNEL ReEseEARCH Section (Starr), ApJUTANT GENERAL’s Orrice, War 
DepartTMENT. PRS Report No. 599. Validity of Learning Ability, OG-056a and 
Clerical Aptitude CA-3, Part A in Certain Sections of the Casualty Branch, 
AGO. October 1944. 

241. PersonNneL Researcn Section (Starr), ApJuTANT GENERAL’s OFFICE, WAR 
DEPARTMENT. PRS Report No. 603. Standardization and Item Analysis of Nine 
Verbal Tests. December 1944. 

242. Personne Researcn Section (Starr), Apsutant Generat’s Orrice, War 
DepaRTMENT. PRS Report No. 610. Analysis of Procedure and Rejection for 
All Induction Stations Operating During a Six Day Period in June 1944. 
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243. PERSONNEL Researcn Section (STAFF), ADJUTANT GENERAL’s OFFICE, WAR 
DepaRTMENT. PRS Report No. 611. Test Selection and Suppressor Variables. 
January 1945. 

244. PERSONNEL ResearcH Section (StaFF), ADJUTANT GENERAL’s OFFICE, WAR 
DepARTMENT. PRS Report No. 612. The Effect of Guessing on the Biserial Cor- 
relation between Two Category Test Items and “True” Scores. March 1945. 

245. PERSONNEL Researcu Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
DEPARTMENT. PRS Report No. 613. Checking the Adjustment of IBM Test 
Scoring Machines. 

246. PersONNEL ResearcH Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
DEPARTMENT. PRS Report No. 617. Validation of Testing Battery Suitable for 
Use in the Selection of Under-Engineer Trainee of the Training Section, Signal 
Corps, War Department. March 1945. 

. PERSONNEL Researcu Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
DepaRTMENT. PRS Report No. 621. Standardization of the General Clerical 
Abilities Test, CC-105a: Part 1-New York Port of Embarkation. April 1944. 

. PERSONNEL Researcu Section (Starr), ApjJuTANT GENERAL’s Orrice, WAR 
DepARTMENT. PRS Report No. 622. Check on the Standardization of Army 
Radio Code Aptitude Test 1944 (ARC-1). August 1945. 

. PERSONNEL Researcw Section (Starr), Apyurant GENERAL’s Orrick, WAR 
DepaRTMENT. PRS Report No. 628. Study of the Difficulties of the Warrant 
Officer General Educational Test. November 1942. 

. PeRSONNEL ResearcH Section (Starr), ADJUTANT GENERAL’s OFFice, WAR 
DEPARTMENT. PRS Report No. 633. A Preliminary Determination of Jtem 
Difficulty and Validity for STU Placement and Achievement Tests in Reading 
and Arithmetic. June 1945. 

. PERSONNEL ResearcnH Section (Starr), ADJUTANT GENERAL’s Orrice, WAR 
DEPARTMENT. PRS Report No. 634. Selection of Content for Final Forms of 
Achievement and Placement Tests for Reading and Arithmetic Courses. 

. PERSONNEL ResearcH Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
DeparTMENT. PRS Report No. 636. Standardization of Shop Mechanics Tests, 
SM-1 and SM-2, and Automotive Information Tests, AIT-1 and AIT-2. June 1945. 

. PersoNNEL Researcn Section (Starr), ADJUTANT GENERAL’s OFFICE, WAR 
DEPARTMENT. PRS Report No. 638. Standard Operational Procedure for Cor- 
recting r’s between Two Variables for Restriction on Third. 

. PERSONNEL ResearcH Section (Starr), ADJUTANT GENERAL’s OFFice, WAR 
DEPARTMENT. PRS Report No. 640. Construction of Experimental and Final 
Forms of Achievement and Placement Tests for Reading and Arithmetic 
Courses in Reception Center Training Units: Construction of Standard Score 
Scales. June 1945. 

. PersonneEL Researcn Section (Starr), ApJuTANtT GENERAL’s OFFrice, WAR 
DeparTMENT. PRS Report No. 641. Development and Validation of the Army 
Automotive Screening Tests for Use in Ordnance Schools. December 1944. 

. Personne: Researcn Section (Starr), ApyJuTANT GENERAL’s OFFICE, WAR 
DeparTMENT. PRS Report No. 64la. Follow-up Study of the Validity of the 
Army Automotive Screening Tests for Use in Ordnance Schools. January 1946. 

. Personne: ResearcnH Section (Starr), ApJuTANT GENERAL’s Orrice, WAR 
DeparTMENT. PRS Report No. 644. Interpretation of Army Test Data for 
Civilian Educational and Occupational Guidance: Relation of Army General 
Classification Test to American Council on Education Psychological Examina- 
tion for College Freshmen (ACE) 1942 edition. August 1945. 

. PersonneL Researcn Section (Starr), ApyuTaNnt GENERAL’s Orrice, War 
DeparTMENT. PRS Report No. 646. Completion of General Classification Test-la 
(Spanish Version). August 1945. 

. Personne, Research Section (Starr), Apyutant GeNeRAL’s Orrice, WAR 
DeparTMENT. PRS Report No. 647. Validity of General Clerical Abilities Test, 
CC-105a, and of Learning Ability Test, OG-056a, for Clerical Jobs at Head- 
quarters, Sixth Service Command, Chicago, Illinois. September 1945. 

. Personne Researcn Section (Starr), ApyuTaANt GeENERAL’s Orrice, War 
DeparTMENT. PRS Report No. 648. Validation of Mechanical Knowledge Parts 
I and II, Paper Form Board, and Learning Ability Tests. September 1945. 

. Personne: Researcn Section (Starr), ApyuTANT GENERAL’s Orrice, WAR 
DeparTMENT. PRS Report No. 652. Construction of, Clerical Aptitude Test 
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CA-3, Part A—Speed; Part B—Fundamentals; and Part C—English for Use in, 
the Placement of Civilian Personnel of War Department Installations. Sep. 
tember 1945. 

262. PersonNeL Researcn Section (Starr), ApyutaANt GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 655. Construction of General Clerical Abilities 
Se CC-105a, for Measuring Aptitudes of Civilian Clerical Workers. November 


263. PersonneL ResearcuH Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 663. Report on Use of the General Mechanical 
Aptitudes Test, CM-142a, for the Selection of Trainees for the Rock Island 
Armament Maintenance School Courses. September 1945. 

264. PersonneL Researcu Section (Starr), ApyJuTaAnt GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 667. Construction of General Mechanical Apti- 
tudes Test for Use with Technical and Mechanical Employees (Civilian). 
October 1945. 

265. PersonNEL Researcn Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 670. Major Study of Comparative Validity oj 
: ive Periodic Officer Efficiency Reporting Methods: 1. Zone of Interior. Decem.- 

er 1945. 

266. PersonNneL Researcn Section (Starr), ApyutaAnt GENERAL’s Orricr, War 
DEPARTMENT. PRS Report No. 671. Comparative Validity of the VD AGO Form 
ad and the FCL-2 According to Various Breakdowns: 1. Zone of Interior. Decem.- 

er 1945. 

267. PersoNNEL Researcu Section (Starr), ApyuTANt GENERAL’s OFfFice, War 
DeparTMENT. PRS Report No. 672. Major Study of Comparative Validity o/ 
Five Periodic Efficiency Reporting Methods: II. European Theater. December 
1945. 

268. Personne, Researcn Section (Starr), ApyutTant GENERAL’s Orrice, War 
DEPARTMENT. PRS Report No. 673. Comparative Validity of the WD AGO 
Form 67 and the FCL-2 According to Various Breakdowns: II. European 
Theater. December 1945. 

269. Personnec Researcn Section (Starr), ApyuraANnt GENERAL’s OrrFice, War 
DepartTMENT. PRS Report No. 674. A Field Study of the Effectiveness of FCL- 
3a, A Self-Training, Indorsed Efficiency Report. November 1945. 

270. PersonneL Researcn Section (Starr), ApyuTANT GENERAL’s Orrice, War 
DEPARTMENT. PRS Report No. 675. The Relationship Between Main Civilian 
Occupation and Other Variables. Part I—Preliminary Study Based on Machine 
Record Survey #2, November 1945. Part I1l—Relation Between Main Civilian 
Occupation and Army General Classification Test Standard Score, March 1945. 
Part IIl—Effect of Rater Training on WD AGO Form 67. January 1946. 

271. Personnet Researcnu Section (Starr), ApyutaAnrt GENERAL’s OFFice, War 
DeparTMENT. PRS Report No. 676. The Effect of Indorsement on the Validity 
of Efficiency Report WD AGO Form 67. December 1945. 

272. Personne Researcn Section (Starr), ApyutaANt GENERAL’s Office, War 
DEPARTMENT. PRS Report No. 677. Experimental Evidence of the Value of Rank- 
ing as a Method of Rating. December 1945. 

273. PersonneL Researcu Section (Starr), ApyutaANnt GENERAL’s OrFice, War 
DepartMENT. PRS Report No. 678. Construction and Scoring of the Officer 
Efficiency Report OER-A, October 1945. 

274. Personne Researcw Section (Starr), Apyutant GENERAL’s OFFIce, War 
DepartMENT. PRS Report No. 679. Construction and Scoring of the Officer 
Efficiency Reports, FCL-2a, b, c. October 1945. 

275. PersonNeEL Researcn Section (Starr), ApyutaANt GENERAL’s OFFICE, WAR 
DeparTMENT. PRS Report No. 681. Construction, Validation, and Standardiza- 
tion of a Battery of Tests for the Army Finance School, Duke University, North 
Carolina. May and June 1944. 

276. PersonNEL Researcu Section (Starr), ApyutaNt GENERAL’s OFFICE, WAR 
DeparRTMENT. PRS Report No. 682. Development of Tests for Termination 
Accountants and Auditors. May and June 1944. 

277. Personne Researcn Section (Starr), ApyutaANt GENERAL’s OrrFice, War 
DeparTMENT. PRS Report No. 683. Validity of AGCT-3a Total and Part Scores 
in Predicting Success in Army Technical Training Courses. May 1946. 
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278. PersonNEL Researcn Section (Starr), Apyutant GeENeRAL’s Orrice, WAR 
DeparTMENT. PRS Report No. 684. Standardization of AGCT-3b Total and 
Part Scores. May 1946. 

279. PersONNEL Researcn Section (Starr), Apyutant GENERAL’s OFFIce, WAR 
DeparTMENT. PRS Report No. 685. Analysis of Military Knowledge Test, TC- 
101x. March 1946. 

280. Personne: Researcn Section (Starr), ApyutTant GENERAL’s OrFrFice, WAR 
DeparTMENT. PRS Report No. 686. Validation of Forms 3 and 4 of Electrical 
Information Test and Forms 3 and 4 of the Radio Information Test among 
Trainees at the Radio Repair Course, CSCS, Camp Crowder, Missouri. July 1946. 

. Personne: Researcn Section (Starr), ApyutaNt GeENeERAL’s Orrice, War 
DepartTMENT. PRS Report No. 687. Validity of Radio Information Test, Forms 
1 and 2, in Predicting Success among Trainees in the Radio Repair Course 
and in the Communications Course at the Tank Destroyer Training School, 
Camp Hood, Texas. July 1946. 

. PersonNEL Researcn Section (Starr), ApyuTANt GENERAL’s OFFICE, WAR 
DEPARTMENT. PRS Report No. 688. Validation and Item Selection for the Elec- 
tricity and Radio Information Test at Truax Field, Wisconsin. July 1946. 

. PersonneL Researcu Section (Starr), Apyutant GeENERAL’s OFFice, War 
DepARTMENT. PRS Report No. 689. Administration of Electrical and Radio 
Information Test to Reception Center Populations. May 1946. 

. PersonneL Research Section (Starr), Apyutant Generat’s Orrice, War 
Department. PRS Report No. 690. Validation of Four Experimental Forms of 
the Electrical Information Test at the New York Trade School and the New 
York Television Institute. July 1946. 

. Personnet Research Section (Starr), Apyutant GeENeRAL’s Orrice, War 
DeparTMENT. PRS Report No. 691. Characteristics of Good and Poor Army 
Nurses Compiled from Essays Written by Medical Officers, Supervisory Nurses, 
General Duty Nurses and Patients. May 1946. 

. Personnet Research Section (Starr), ApyutaNnt Genera’s Orrice, War 
DeparTMENT. PRS Report No. 692. Development and Use of Army Trade 
Screening Tests in ASF, March 1946. July 1946. 

. Personnet Researcn Section (SrarF), Apyutant GeENeRAL’s Orrice, War 
DeparTMENT. PRS Report No. 692a. Use of Army Trade Screening Tests to 
Evaluate the Effectiveness of Training in ASF Training Centers. Supplement I. 
March 1946. 

. Personnet Researcn Section (Starr), ApyutaNnt GENeERAL’s Orrice, War 
DeParRTMENT. PRS Report No. 692b. Use of Army Trade Screening Tests to 
Evaluate the Effectiveness of Training in ASF Training Centers. Supplement 
II. March 1946. 

. Personnet Researcn Section (Starr), Apyutant GeENeERAL’s OrFice, WAR 
DeparTMENT. PRS Report No. 693. Development of Instruments for Selection 
of Enlisted Personnel for Recruiting Work. July 1946. 

. Personnet Researcn Section (Starr), Apyutant GenerAv’s Orrice, War 
DEPARTMENT. PRS Report No. 694. Performance and Written Tests and Per- 
sonal Data Factors as Predictors of Supervisory Ratings of Competence of 
Specialists in AAF Fighter Combat Units in Continental U. S. September 1943. 

. Personner Researcn Section (Starr), Apyurant Generat’s Orrice, War 
Department. PRS Report No. 694a. Supplement to Report on Performance and 
Written Tests and Personal Data Factors as Predictors of Supervisory Ratings 
of Competence of Specialists in AAF Fighter Combat Units in Continental 
United States. August 1946. 

. Personner Researca Section (Starr), Apyutant Generaw’s Orrice, War 
DeparTMEeNT. PRS Report No. 695. Correlational Analysis of Sixteen Tests 
(Arlington Hall Factor Analysis Study). July 1945. 

. Personnet Researcn Section (Starr), Apyutant GENERAL’s Orrice, WAR 
Department. PRS Report No. 697. Validation of General Clerical Abilities 
Test, CC-105a, and Certain Other Tests of Clerical Aptitude. February 1946. 

. Personnet Researcn Section (Starr), Apyutant Generaw’s Orrice, War 
DeparTMENT. PRS Report No. 700. Item Analysis of the Multiphasic Personality 
Inventory, Based on Data Collected at Camp Stewart under Project PR-4030. 
June 1945, 
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295. PersoNNEL ResearcnH Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 701. Methodological Investigation of the Forced 
Choice Technique, Utilizing the Officer Description and Officer Evaluation, 
Blanks. July 1945. 

296. PERSONNEL ResearcH Section (Starr), ApsuTANt GENERAL’s OFFICE, War 
DepaRTMENT. PRS Report No. 702. Obtaining Officer Preference and Officer 
Characteristics Scale Values of Adjective for Use in Construction of Items for 
the Biographical Information Blank, PR-4061-02. July 1945. 

297. PersonNEL ResearcH Section (Starr), ApsuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 703. Construction and Selection of Items for 
the Biographical Information Blank (BIB). July 1945. 

298. PERSONNEL ReseaRcH Section (Starr), ApJUTANT GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 704. Validation of a Program for Selection oj 
Officers for Retention in the Peacetime Army. July 1945. 

299. PersONNEL ReseaRrcH Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 705. Development of an Interview Procedure for 
Use in the Officer Selection Procedures, PR-4061-09 and 4061-10. July 1945. 

300. PersonNEL ResearcH Section (SrarrF), ApsuTANT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 706. Characteristics of Successful and Unsuc- 
cessful Officers Studied for the Development of Officer Evaluation and Report. 
ing Forms, PR-4061-08. August 1945. 

301. Personnet Researcu Section (Starr), ApJuTANT GENERAL’s OrFrice, War 
DEPARTMENT. PRS Report No. 707. Analysis of Rating Made with the WD AGO 
Form 67, Efficiency Report. July 1945. 

302. PersonNEL ResearcH Section (Starr), ApyutTaANtT GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 708. Analysis of Ratings of Air Force Officers 
on AAF Form No. 123, Officer Evaluation Report. July 1945. 

303. PersonneL ResearcH Section (Starr), Apyutant GENERAL’s OFFice, War 
DEPARTMENT. PRS Report No. 711. Predictions of Leadership Qualifications 
of Officer Candidates in the Signal Corps, PR-4071b. March 1946. 

304. PersonneL ResearcnH Section (Starr), ApyuTANt GENERAL’s OFFIce, War 
DeparTMENT. PRS Report No. 71la. Prediction of Tactical Performance of Off- 
cer Candidates in Signal Corps, Supplement to Report and Recommendations, 
PR-4071b. March 1946. 

305. PersonNeEL ResearcH Section (Starr), ApyuTaNt GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 712. Officer Retention Project Equivalent Scales 
for the Two Forms of Officer Classification Test, OCT-14A and OCT-14B. 
June 1945, PR-4061. June 1945. 

306. PersonNeEL ResearcuH Section (Starr), ApyuTant GENERAL’s OrrFice, War 
DEPARTMENT. PRS Report No. 713. Development of the General Survey Test, 
Camp Blanding, PR-4061. May 1945. 

307. PersonNEL Researcnh Section (Starr), ApyutaANt GENERAL’s Orrice, War 
DEPARTMENT. PRS Report No. 714. Validation of Officer Classification Test, 
OCT-14, as a Predictor of Grades at the Command and General Staff School, 
Fort Leavenworth, Kansas. 

308. PersonNeL Researcu Section (Starr), ApyutANt GENERAL’s Orrice, War 
DeparTMENT. PRS Report No. 715. Possibility of Predicting Proper Classifica- 
tion of Officer on Basis of Differential Scoring of FCL-2a Items, Part II (Most- 
Least), PR-4073. April 1946. 

309. PersonNEL Researcnw Section (Starr), ApyuTaANt GENERAL’s OrFice, War 
DEPARTMENT. PRS Report No. 717. Comparison of Rating Check List (RCL) 
as Forced Choice List (FCL) Methods of Obtaining Ratings, September 1945, 

-4073. 

310. PersonneL ResearcuH Section (Starr), ApyutaNt GENERAL’s OFFICE, WAR 
DeparTMENT. PRS Report No. 718. The Development and Evaluation of Classifi- 
cation Tests R-3 and R-4. June 1946. 

311. Personne Researcu Section (Starr), ApyutTaANt GENERAL’s OFFice, WAR 
DEPARTMENT. PRS Report No. 722. Data Concerning Possible Cut-Off Scores 
~ the General Survey Test for the 2nd Officer Integration Program, PR-4096. 
uly 1946. 

312. Personne Researcn Section (Starr), ApyuTANt GENERAL’s OrFice, WAR 
DEPARTMENT. PRS Report No. 723. Development of Predictor Instruments Used 
in Study of Selection of Candidates for Officer Training, PR-4076. August 1946. 
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313. PERSONNEL Research Section (Starr), ApJuTANT GENERAL’s OrFice, WAR 
DEPARTMENT. PRS Report No. 724. The Relationship of Army Individual Test 
Subscores and Other Mental Ability Tests to Diagnosis of Mental Disorder. 
June 1945. 

. PersoNNEL Researcn Section (Starr), ADJUTANT GENERAL’s Orrice, WAR 
DeparTMENT. PRS Report No. 726. Analysis of Item Responses of W AC's and 
WAC Applicants on the Multiphasic Personality Inventory (TC-8a) and the 
Cornell Selectee Index Administered at Grand Central Palace, New York City. 
May and June 1944. 

. PERSONNEL Researcu Section (Starr), ADJUTANT GENERAL’s OFFice, WAR 
DEPARTMENT. PRS Report No. 727. Construction of Biographical Information 
Blanks NSB-1 and NSB-2 for Nurses and Women Medical Specialists. Septem- 
ber 1944. 

. PersoONNEL Researcw Section (Starr), ApJuTANT GENERAL’s OFFICE, WAR 
DeparTMENT. PRS Report No. 728. Construction of Army Nurse Evaluation 
Report Form NSE-1B and Army Nurse Evaluation Report Supplement Form 
NSE-1Bs. October 1946. 

. PersoONNEL Researcn Section (Starr), ADJUTANT GENERAL’s Orrice, WAR 
DeparTMENT. PRS Report No. 801. Validation of the General Clerical Abilities 
Test, CC-105a, as a Selection Instrument for the Position of File Clerk, CAF-2, 
Decorations and Awards Branch, AGO. March 1946. 

. PersONNEL Researcn Section (Starr), ADJUTANT GENERAL’s Orrice, WAR 
DepaRTMENT. PRS Report No. 1000. The Determination of a Qualifying Score 
on Army Specialized Training Test, OCT-2, X-3 for Selection of Men for the 
Army Specialized Training Program. January 1943. 

. PERSONNEL Researcu Section (Starr), ApyuTaANt GeENERAL’s Orrice, WAR 
DepaRTMENT. PRS Report No. 1001. Selection of Officer Candidates: Validity 
of Officer Candidate Test OCT-2, X-3 for Predicting Academic Averages of the 
West Point 1943 Fourth Class. March 1943. 

. PERSONNEL Researcn Section (Starr), ApJuTANT GENERAL’s Orrice, WAR 
DEPARTMENT. PRS Report No. 1004. Standardization of United States Army 
and Navy Test C-1 for Civilian Candidates for the Army Specialized Training 
Program. April 1943. 

. Personnet Research Section (Starr), ApyuTant GeENERAL’s Orrice, WAR 
DeparRTMENT. PRS Report No. 1009. Prediction of Success in the First Term 
Basic Engineering Curriculum at Syracuse University. September 1943. 

. PersoNNEL Researcu Section (Starr), ApyutaANt GENERAL’s Orrice, War 
DeparTMENT. PRS Report No. 1020a. AST Achievement Test Report: Decem- 
ber 1943, Standardization Testing. February 1944. 

23. PersONNEL Researcu Section (Starr), ApsuTaANt GENERAL’s Orrice, War 
DepartTMENT. PRS Report No. 1025. Standardization of Army-Navy Qualifying 
Test C-2 Administered. November 1943. 

. Personne. ResearcH Section (Starr), ApyJuTANT GENERAL’s Orrice, WAR 
DepaRTMENT. PRS Report No. 1026. Prediction of Success in the ASTP Basic 
Engineering-1, Term 1 Curriculum at City College of New York. June 1944. 

. Personne, Researcu Section (Starr), ApyuTANtT GENERAL’s Orrice, WAR 
Department. PRS Report No. 1027. Relation of the Aptitude Test for the 
Medical Professions, Form 20, First Edition, to the Army General Classification 
Test. February 1944. 

. Personne. Researcnu Section (Starr), ApyuTaAnt GENERAL’s OrrFice, WAR 
DeparTMeNT. PRS Report No. 1027a. Relation of the Aptitude Test for the 
Medical Professions, Form 20, First Edition, to the Army General Classifica- 
tion Test, for Candidates (a) Preferring Pre-Medical Training, (b) Preferring 
Pre-Dental Training and (c) Not Interested in Pre-Medical or Pre-Dental 
Training. 

. Personne: Researcn Section (Starr), Apyutant GeNERAL’s Orrice, WAR 
DeparTMENT. PRS Report No. 1028. Item Analysis of January 1944 ASTP 
Experimental Tests. February 1944. 

. Personne: Researcn Section (Starr), ApyuTaNnt GENERAL’s Orrice, War 
DepartTMENT. PRS Report No. 1030. Results of the Administration of the Apti- 


tude Test (Form 20) for the Medical Professions to the ASTP Trainees. March 
1944. 
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DeParTMENT. PRS Report No. 1031. Results of the Administration of the Api. 
tude Test (Form 21) for the Medical Professions to ASTP Trainees. March 1944 

. Personnet Researcn Section (Starr), Apsutant GENERAL’s OFFICE, Way 
DEPARTMENT. PRS Report No. 1032. Some Preliminary Investigation into the 
Relationships of the Aptitude Test for the Medical Professions to other Edy. 
cational and Personal Variables. March 1944. 

. PersonneL Researcn Section (Starr), ApyutaAnt GENERAL’s OFFICE, Wa, 
DEPARTMENT. PRS Report No. 1034. Standardization of Army-Navy College 
Qualifying Test C-3. April 1944. 

. PersonneL Researcnu Section (Starr), ApJuTANT GENERAL’s OFFICE, War 
DeparTMENT. PRS Report No. 1036. Comparison of AST Achievement Tes: 

Results in the December 1943 Standardization with Results in the January 1944 
Population. April 1944, 

PersoNNEL Research Section (Starr), ApyutaNt GENERAL’s OFFICE, War 
DEPARTMENT. PRS Report No. 1037. The Effect of Scoring Formula Upon the 
Reliability of AST Achievement Tests. April 1944. 

PERSONNEL ResearcH Section (Starr), ApJuTANT GENERAL’s OFFice, War 
ogee PRS Report No. 1041. The Relation between Specialized Training 

est. 

PersONNEL ResearcH Section (Starr), ADJUTANT GENERAL’s OFFICE, War 
DepartTMENT. PRS Report No. 1049. Results of Experimental Study of F fects 
of Directions Against Guessing and of Corrections for Guessing on Scores on 
ASTP Contract Tests, State University of lowa. 

. PersonNEL Research Section (Starr), ApyutaNnrt GENERAL’s OrFice, War 
DEPARTMENT. PRS Report No. 1050. Comparison of Prediction of Success in 
Terms I and II ASTP Basic Engineering Curriculum at Syracuse University. 
June 1944, 

PersONNEL ResearcH Section (Starr), ApyutaNt GENERAL’s OFFice, War 
DeparTMENT. PRS Report No. 1051. The Relationship Between Formal Item 
Analysis and Reliability on AST Achievement Tests (October 1943 Regular, 
October 1943 Experimental, and December 1943 Standardization Tests). May 
1944. 

PeRSONNEL Researcu Section (Starr), ApyutTaANtT GENERAL’s OFFIce, War 
DEPARTMENT. PRS Report No. 1052. The Validity of the AGCT, American Coun- 
cil on Education Psychological Examination (1942 edition), Army Specialized 
Training Test OCT-2, X-3, and the WPQ-1, X-1 Language Aptitude Test as 
Predictors of Success in the ASTP Language Curricula at the College of the 
City of New York, Syracuse University, Boston University, and Michigan State 
College. June 1944, 

PersonNneL Researcu Section (Starr), ApJuTANT GENERAL’s OrFice, War 
DEPARTMENT. PRS Report No. 1053. Survey of the Socio-Economic Level and 
Post-War Educational Plans of Approximately 8000 Enlisted Men Assigned 
ad the ASTP Basic Engineering-1 Curriculum at 21 Training Centers. Septem- 

r 1944, 


. Personne Researcu Section (Starr), Apyutant GENERAL’s Orrice, War 


DepARTMENT. PRS Report No. 1066. Validation of the Mathematics Inventory, 
A510-2, Arl as a Predictor of Success in Term 1 of the Introductory and Basic 
ASTR Curricula. March 1945. 

PersONNEL ResearcH Section (Starr), ApyutANt GENERAL’s OrrFice, War 
DepartMEeNT. PRS Report No. 1067. Standardization of ASTRP Qualifying 
Test C-4. April 1945. 
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CHAPTER V 


Research For or By the Armed Forces 


JOHN C. FLANAGAN and DOROTHY B. BERGER 


Selection Research on Mental Adjustment 


Researcu in the field of mental adjustment in the Armed Forces con- 
sisted of (a) selection studies which used tests mainly of the paper-and- 
pencil variety to appraise the emotional stability of servicemen and (b) 
studies which attempted to make the same assessment on the basis of psy- 
chiatric interviews, situation tests, and other screening procedures. 

The use of the Rorschach in evaluating military personnel was discussed 
by Linn (134) in a study in which Rorschach records obtained from a 
group of enlisted men assigned to a hospital were compared with per- 
formance ratings a year later after eight months of overseas duty. Re- 
sponses given by well-adjusted soldiers were markedly different in many 
respects from norms based on well-adjusted civilians. The hypothesis was 
advanced that personality constriction and regression were produced by 
military indoctrination. 

A group of papers on the construction, standardization, application, 
and results of research on the Cornell Indices and the Cornell Word Form, 
included a report by Mittlkemann and Brodman (159) suggesting that the 
Cornell Service Index, Selectee Index, and Word Form were designed to 
differentiate quantitatively individuals with personality and psychosomatic 
disorders and to facilitate qualitative diagnosis of these disorders. A report 
by Weider and Wechsler (261) discussed the results of the application of 
the Cornell Indices and Word Form, the criteria of significant answers, and 
validity data. Wolff (266) indicated that with their basis of clinical expe- 
rience and psychological and psychiatric principles the Cornell instru- 
ments might be used at induction stations, clinics, neuropsychiatric wards, 
and medical and surgical wards in hospitals, or in industry, veteran place- 
ment, research, and hospitals and clinics in civilian work. Harris (78) 
discussed the use of the Cornell Selectee Index as an aid and timesaver in 
the psychiatric diagnosis of naval personnel. 

The Personal Inventory was discussed by Shipley and Graham (199) 
who presented a report and complete bibliography of the work on that and 
other tests of emotional stability. Satter (188) reported a study in which 
the results of the Personal Inventory were compared with success and 
failure in parachute school. Satter (187) also discussed the inability of 
the Personal Inventory, the Otis Tests of Mental Ability, the Two-Hand 
Coordination, several other tests, and psychiatrists’ evaluations to predict 
officers’ ratings of enlisted men in the submarine service. Shipley, Gray, 
and Newbert (200) found that the Personal Inventory differentiated be- 
tween discharges from the Navy and men still active in the service after 
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one year, between good and bad conduct cases, and between those rated 
and those not rated. Berry, Leavitt, and Mote (22) compared formats 4 
and B of the Jnventory and found them to be parallel. 

Wexler (215, Chapter 9) discussed the measures of personal adjustment 
developed and used in the Bureau of Naval Personnel and Watson (257) 
the prognostic value of the psychological tests in the Navy Officer Training 
Program. 

Selection on the basis of neuropsychiatric screening was described by 
Southworth (211) who presented data on rejections and factors influencing 
rejections at the Great Lakes Naval Training Station. Newman, Bobbitt. 
and Cameron (162) reported on a reliability study of an interview by two 
psychologists and one psychiatrist for the evaluation of officers in the 
Coast Guard. Biserial correlation coefficients were reported for failures 
to graduate from Submarine School for four tests developed thru the use 
of psychiatric criteria by Bartlett (16) along with evidence showing the 
relationship between clinical evaluations and school failure. 

The psychobiological screening procedures in the War Shipping Ad. 
ministration were discussed by Killinger and Zubin (102) who pointed 
out that the screen caught 85 percent of those who would eventually have 
to be disenrolled. The effectiveness of battle-noise equipment as a test 
for emotional stability was evaluated by Hartley and Jones (79). 

The selection of workers for strategic services was described by the Office 
of Strategic Services Assessment Staff (229). In addition to a vocabulary 
test, a sentence completion test, a health questionnaire, a work conditions 
questionnaire, and a personal history form, the process included several 
outdoor tests such as the Brook Test, the Wall-Scaling Test, the Construction 
Test, and several paper-and-pencil tests such as the Map Memory Test, the 
Bennett Mechanical Comprehension Test, and the Manchuria Test of Propa- 
ganda Skills. Murray and MacKinnon (160) pointed out that altho no 
follow-up has been completed, only one of the 300 cases selected by the 
OSS staff failed because of a neuropsychiatric condition. 

Steinberg and Wittman (212) discussed a study of the sociological, per- 
sonality, and adjustment characteristics of hospital patients who sup- 
posedly broke under camp life, of veterans in a mental hospital, and of 
well-adjusted soldiers. A study of the interests of Marine Corps women 
as measured by the Kuder Preference Record was reported by Hahn and 
Williams (77). Adams and Fowler (1) presented a report on the reliability 
of two forms of an activity preference blank used to select fire controlmen. 


Selection Research on Intelligence 


Research in the selection of personnel on the basis of intelligence in- 
cluded a group of studies on the Wechsler-Bellevue Test. The value of five 
of the subtests of the Bellevue verbal scale in differentiating among nor- 
mal, dull-normal, borderline, and mentally deficient groups in the exami- 
nations of naval recruits was discussed by Lewinski (113). Altus (7) con- 
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cluded, from a study of the validity of the Wechsler-Bellevue, that the 
validity is somewhat higher for the total scale than for Form B of the 
scale. Hunt (86) considered the use of a ten-minute individual test of 
intelligence, the Wechsler-Bellevue, reading and language handicap tests, 
the Rorschach, and various educational tests for selecting naval recruits. 
Correlations between the original and revised Kent Emergency Scales, and 
between the Kent and the Stanford-Binet, and the Wechsler-Bellevue were 
discussed by Lewinski (114). Greenwood, Snider, and Senti (69) described 
astudy of the correlation between the Wechsler Mental Ability Scale, Form 
B, and the Kent Emergency Test administered to 200 Army personnel. A 
correlation of .74 + .02 was found between the two tests and it was con- 
cluded that the Kent is suitable for intelligence testing in situations not 
permitting more extensive testing. 

A study was presented by Lindsley (121) which indicated that students 
with an Otis Intelligence Test score of —1 or less would fail in the filter 
course at Camp Murphy, Florida. Colmen (33) described a five-minute 
group test which was found to be adequate and reliable for measuring 
intelligence without being influenced by illiteracy. 


Research in the Selection of Officers 


Jensen and Rotter (91) reported that of thirteen psychological tests 
investigated as screening instruments, the most efficient combination for 
predicting academic success was the Personnel Test (Wonderlic modifica- 
tion of the Otis Higher Examination), the Arithmetic Computation (Stan- 
lord Achievement Test, Advanced), and the Combined Paragraph- and 
Word-Meaning sections of the Stanford Achievement Tests. 

A program consisting of an interview in which past history rating was 
obtained, a standardized life-like construction test which yielded ratings 
on seven basic traits related to combat leadership ability, a specially 
devised sensori-motor test, a rapid projection test, and the group use of the 
TAT, was discussed by Murray and Stein (161). 

An analysis of the records of two classes at Fort Sill Artillery School 
by Garrett and Ligon (67) revealed that combat efficiency was not very 
closely related to ratings for leadership obtained in OCS, that there was 
some indication that the best officers came from age range 22-28 and that 
above a certain desirable minimum, intelligence as measured by the GCT 
had little relevance to combat performance. 

A group of studies on the selection of officers for the Navy included re- 
ports by Cornehlsen (38, 39, 215) on the growth of the selection and 
classification program for officers and for reserve officers for billets; by 
Miller and Owens (215) on the Basic Tests for Officer Personnel; by Fred- 
eriksen and Peterson (65) on the development and validity of the Navy’s 
Officers’ Qualification Test; by Gulliksen (70) outlining the specifications 
for an Officers’ Selection Test; by Frederiksen (58) on the comparison 
of the Officers’ Qualification Test, Form 1, and the U. S. Navy Aptitude 
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Test, Form E-2; by Gulliksen (72) on the preparation of Form 1 of the 
Navy Officer Qualification Test showing that the revised test had a wide 
range of difficulty and a satisfactory reliability; by Frederiksen (61) on 
the preparation of norms for the Officer Qualification Test, Form 1; by 
Gulliksen (71) on the preparation of norms for women on the Officer 
Qualification Test, Form 1; by Peterson (172) who gave a statistical evalua. 
tion of the Navy Officer Qualification Test, Forms 2 and 3; and by Frederik. 
sen (60) who discussed the preparation of a spatial relations test, consisting 
of multiple-choice items concerning the rotation of solid forms, for selecting 
radar officers. Conrad and Lannholm (215) described the prediction of 
success in Primary Officer Training School and Maucker (215) described 
such prediction in Advanced Officer Training School. 


Research in the Selection of Enlisted Personnel 


The selection of enlisted men in the Navy was discussed in a group of 
studies. Odell (215) gave an account of the growth and development of 
the selection, qualification, and classification programs. Bond and Miller 
(215) described the development of the Basic Test Battery. The Staff of 
the Bureau of Naval Personnel (219) presented studies on item analyses, 
time limits, reliabilities, norm development, validity, intercorrelations, and 
factor analyses of the Basic Test Battery. The Staff of the Bureau (221) 
also discussed the validity of Form 1 of the Basic Test Battery for selection 
for two types of elementary training schools. Bloom and Brundage (215) 
described the prediction of success in elementary enlisted schools. Curtis 
(215) described prediction in the advanced schools. Satter (187) found 
that there was no relationship between the scores on the Otis Higher Ex- 
amination, the Personal Inventory, and the Two-Hand Coordination Test 
and officers’ ratings of submarine crewmen on the job. Graham, Mote, and 
Berry (68) found that the same battery predicted “tank escape” perform- 
ance failures considerably better than chance for submarine crewmen. 
Miller (157) reported a study on the choice of a test battery for selection 
of LCVP coxswains thru the use of the Wherry-Doolittle test selection 
method. Miller (156) also described reliability studies of six apparatus 
tests used with the Navy Basic Battery for the selection of LCVP coxswains, 
and in another study (154) reported that the Navy finally chose a hand 
dynamometer and a pegboard test from this group of mechanical tests 
on the basis of correlations with ratings on boat handling. 


Research in Selection of Enlisted Personnel 
for Particular Jobs 


Studies concerning the selection of fire controlmen and radar operators 
included studies of vision such as the one by Adams, Fowler, and Imus 
(3) which discussed the relationship between visual acuity and acuity of 
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sereoscopic vision. Adams and others (4) found that the Ortho Rater 
Tests were sufficiently reliable as testing devices in the selection of candi- 
dates for training as fire controlmen and range finder operators. The inter- 
relationships among seven tests of stereoscopic acuity and the relationship 
between two tests of visual acuity and two tests of Phorias were pointed 
out by Fowler, Imus, and Mote (56). The battery used in the selection of 
fre controlmen, range finders, and radar ‘operators was discussed by Beier 
and others (21). Imus (88) presented the directions, procedures, tests, 
and equipment used in the selection and classification of fire controlmen. 
The final report on the Selection and Training of Radar Operators was 
made by Lindsley (117). 

The Staff of the Bureau of Naval Personnel (222) described the con- 
struction of selection and achievement examinations and the conduct of 
technical personnel research designed to facilitate the selection and train- 
ing of personnel in the maintenance and repair of electronic equipment. 
The predictive efficiency of the Navy Basic Test Battery at gunner’s mates 
school was discussed by the staff of the Bureau of Naval Personnel (226). 

Methods of selecting naval gun and engineering crews were discussed 
in a complete summary and bibliography of the Gunnery Project of the 
Applied Psychology Panel by Viteles, Gorsuch, and Wickens (242). 
Rogers, Viteles, and Voss (180) presented similar material for the Ap- 
plied Psychology Panel Engineering crew project. 

MeQuitty (139) described the personnel selection program at an Engi- 
neering Replacement Training Center which was continuous, being co- 
ordinated with the training program, and was based upon the AGCT score, 
formal education, and first and second best civilian occupations. Selection 
for specialists’ courses was on the basis of interest, success in a specialty, 
related hobbies, educational background, and aptitudes. 

The work on the selection and training of night lookouts, including a 
discussion of validation of old night-vision tests, of the measurement of 
the performance of night lookouts at sea, and of an analysis of the night 
lookout’s job, was summarized by Wedell (259). In a study of hearing 
in searchlight and other personnel requiring exceptionally good hearing, 
Clarke (31) reported that audiometers were unsuited for mass testing of 
acuity of hearing and suggested that gramophone records with words of 
varying intensities be employed. The study reported that only 40 percent 
of the 100 soldiers trained as listeners or spotters appeared to have, for 
both ears, hearing superior to three decibels loss. The best combination 
of tests for selecting Army weather observer students was reported by 
Cleveland, Faubion, and Harrell (32) to be a mental alertness test, a 
physics achievement test, and a meteorological achievement test. 

Kurtz, Seashore, and Willits (107, 105) presented a discussion of the 
Code Receiving Tests developed by the Applied Psychology Panel. 

Reid (178) reported that in aptitude tests for drivers in the Third 
Armored Division, the greatest number of failures occurred for glare blind- 
ness, defective acuity, and depth perception. A yarn test for color vision, 
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a field of vision test, and tests of depth perception, glare, balance, stability. 
reaction time, and visual acuity were given to 10,000 prospective heavy. 
truck drivers in this study. The selection and training of cargo handling 
teams for combat-laden vessels was discussed by Ruch (185). 

Selection tests and causes for rejection were described by Thomas (224) 
in a report on the selection of parachutists. 

A report on the construction of various performance tests, group and 
individual, and of checklists for objective observation in the Teacher 
Training Department of the Armored Force School, Fort Knox, was pre. 
sented by Siro (202). 


Research on Classification and Aptitude Tests 


Research in classification in the services included studies such as those 
McCain and Schneidler (135) discussed concerning the classification and 
selection of enlisted personnel. Eurich and McCain (50) also described 
the initial classification in which each recruit was given the general classifi. 
cation, reading, arithmetic reasoning, mechanical aptitude, clerical and 
mechanical knowledge tests, an interview, and then a recommendation for 
two possible jobs. The specifications for the construction of a general 
classification test, a reading test and a test of arithmetical reasoning were 
outlined by Frederiksen (62). The selection of items and a comparison 
of the selected items with those previously used for the General Classifica- 
tion Test, Form 2, and for the tests of reading and arithmetic was described 
by Satter (189). Wrenn (270) included in his description of Navy per- 
sonnel procedures the nature of the classification interview and the train- 
ing of interviewers. The derivation of national norms for the Fleet Edition 
of the General Classification Test was described by Peterson (171) who 
concluded from the data collected that the GCT (x-l-s) served satisfac- 
torily as a self-administering test and constituted a parallel form of the 
GCT, Form 1. He (170) also reported ona factor analysis of the new 
Navy Basic Classification Test Battery. A statistical evaluation of the Basic 
Classification Test Battery, Form 1, led Conrad (35) to the conclusion 
that the battery competently fulfils the essential requirements. 

Procedures and tests used to select men for assignment to fill the balance 
crews for newly commissioned destroyers on the Pacific Coast were dis- 
cussed by Levin (111). The work of the Classification Section of the 
Armored Force Replacement Training Center described by Wittman (264) 
included material on the psychological, clerical, and mechanical-aptitude 
testing; on occupational interviews, testing and classification; on assign- 
ment to different military duties; on the selection of officer training candi- 
dates; on the liaison relationships with regular training companies; on the 
record keeping and planning of activity flow; and on research and selection 
of men for the Special Training Unit which handles and studies physical, 
mental, and psychological problems. Malone (142) described the Army 
Classification system, the use of the AGCT, the Mechanical Aptitude Test, 
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and the Radio Operators Aptitude Test. An activity preference test furnish- 
ing a number of scores corresponding to clusters of functionally related 
activities was reported by Kelley (100) who outlined the steps leading to 
its development. The selection of personnel with superior vision for the 
crew of the USS New Jersey was described by Verplanck (231) in a study 
using an NDRC Adoptometer Model II. 

Research on aptitude tests for the Navy was reported by Frederiksen 
(63) who described a study made on an experimental battery of aptitude 
tests as predictors of service-school grades for inclusion in the Basic 
Classification Test Battery; by Conrad (36) who presented the research 
and developmental history of the Navy’s aptitude testing program; by 
Conrad (34) who also discussed the basic statistical facts concerning indi- 
vidual items of the Navy Aptitude Tests and interpreted these facts with 
reference to various problems; by Stuit and Feder (215) who described 
the development of special aptitude tests by the Bureau of Naval Personnel; 
by Gulliksen, Conrad, and Frederiksen (75) who confirmed an earlier 
conclusion, by studying the averages, standard deviations, and intercorre- 
lations of the Navy aptitude tests, that variations in procedure from one 
station to another constitute a serious problem; by Gulliksen (73) who 
compared the selection of test items for a mechanical comprehension test 
by an item analysis based on an external criterion and by the technic of 
item-total correlation and also (74) presented minor modifications which 
could be made in a short time to the Navy Mechanical Aptitude Test, 
Form T, and made suggestions for a more thoro revision of the test. 

Validation research was presented by Frederiksen (64, 63) in a dis- 
cussion of the validities of aptitude tests at various schools; by Crawford 
and Burnham (45) in a report on the results of the educational aptitude 
testing of V-12 students in which it was found that aptitude tests proved 
to be effective predictors of academic work measured by objective achieve- 
ment tests; and by Anderson and others (10) in a paper on the Oscilloscope 
Operator tests. Prediction of ability was discussed by Kurtz (104) in 
relation to code learning. Prediction of success in Electricians’ Mates 
School was presented by Conrad and Satter (37) in a discussion of the 
use of test scores and quality classification ratings. A report was given by 
Smith and Voss (206) on a study of the effectiveness of the classification 
procedures for officers of the Amphibious Training Command. Smith and 
others (207) also reported on the effectiveness of classification data in 
predicting billet performance in training in the Amphibious Force. Predic- 
tion of success in service school from the order of assignment was discussed 
by Satter and Conrad (190). A study presented by Wedell (260) reported 
on the prediction of the performance of night lookouts. 

Procedures used in a job analysis of the tasks performed at gun stations 
were enumerated by Viteles and Smith (243). The final report and sum- 
mary of work in job analysis qualification and placement of personnel in 
the Amphibious Force was presented by Smith (205). 

A selectometer for weighting the qualities on which interviewers rate 
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men was discussed by Keislar (93) and by Campbell (30) who made a 
final report on research and development of classification aids. Another 
classification aid, the point-score method for evaluating Naval personnel 
was presented by Levin (112). Viteles (233) presented an interviewer's 
recommendation chart which shows in visual form the billets aboard ship 
for which an individual with given qualifications is most adapted. The 
personal preference technic, which employs the opinions of co-workers or 
students who have had adequate time to observe their fellows, was dis- 
cussed by Wiggin and Bartlett (263) as a possible supplement to in- 
structor’s grades. 


Research on Training 


In an article on military training and learning theory, Wolfle (267) 
indicated that help was given to military specialists in World War II by 
psychologists who applied such principles of learning as distribution of 
practice, active participation, variation of material, accurate records of 
progress, knowledge of results, and systematic lesson plans. Applications 
of these theories appeared in some of the work discussed below. 

Gunnery-training research included such studies as the ones discussed 
by Viteles et al (235, 245, 246, 249, 251, 252, 253), outlining training 
aids, lesson plans, and courses of instruction for a four-day course in 
20 mm and 40 mm gunnery; by Viteles (234) who investigated the 
scoring characteristics of the Machine Gun Trainer, Mark 1; by Smith et al 
(208) presenting a memorandum on gunnery teaching; by Covner and 
Viteles (40) presenting instruction in engineering, damage control, and 
gunnery at the CVE Precommissioning School; and by Viteles, Gorsuch, 
and Wickens (241) describing the standardized four-day courses using 
unit lesson plans for a gunnery-training program, and the study of syn- 
thetic training devices. Range-estimation studies included Wickens, Gor- 
such, and Viteles’ (262) account of lesson plans for instruction on the 
mirror Range Estimation Trainer Device 5C-4; Voss and Wickens’ (255) 
comparison of free and stadiametric estimation of opening range; Horo- 
witz and Kappauf’s (84) description of the accuracy of unaided visual 
range estimation for aerial targets at ranges between 1500 and 8000 
yards; Viteles et al’s (248) analysis of the results obtained in training 
men in range estimation on the firing line; and by Rogers’ (181) evalu- 
ation of methods of training in estimating a fixed opening range. Hoffman 
and Mead (82) discussed the performance of Anti-Aircraft Artillery per- 
sonnel on a complex task of four-hours durs tion. Research in the training 
of engineers was discussed by Rogers, Viteles, and Voss (179, 180) ; and 
by Viteles, Gorsuch, and Watters (240) who discussed the improvement 
of a training program for newly organized crews for destroyers and for 
auxiliary ships. Masoner and Watters (149) presented an instructor's 
manual which served as a guide in the administration of special engineer- 
ing courses. A manual for training balance crew engineers for attack 
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transport vessels was described by Covner, Gorsuch, and Viteles (41). 
Viteles and Gorsuch (236) prepared a memorandum on effective teaching 
methods for engineering instructors. A group of studies concerned with 
progressive engineering were reported by Viteles and Gorsuch (237, 238) 
who presented lesson plan outlines for Stages I and II of the instruction; 
by Organist et al (167), who prepared an instructor’s manual for Stage II 
and who presented outlines for instructions in Stage III (165); and by 
Organist and Willis (166) describing the organization and instruction 
for Stage III. Covner et al (44) prepared an instructor’s manual for pre- 
senting information on the distilling plant to engineering personnel. 

The training of fire controlmen and range finder operators was pre- 
sented by Beier et al (20) in the form of a series of lesson plans, The in- 
fluence of visual tasks in the training course of fire controlmen upon their 
visual proficiency was discussed by Adams, Beier, and Imus (2). Covner, 
Gorsuch, and Viteles (42) presented a manual for instructors with a 
detailed step-by-step procedure for operation of fireroom equipment on 
destroyer escort vessels. 

The training of radar operators was reported on by Lindsley et al (132), 
who discussed the use of the Philco trainer for A-scan oscilloscope oper- 
ators; and by Lindsley et al, in a series of articles (120, 128, 133), 
describing and presenting recommendations and generalizations for the 
use of the PPI flash-reading and tracking trainers in training Navy search- 
radar operators. Lindsley (119) also gave an account of the results of a 
study determining the effectiveness of the course of training SCR-270-71 
radar operators. The effectiveness of the Foxboro Trainer in training 
oscilloscope operators to track by means of pip-matching was evaluated by 
Lindsley and others (129), who also made recommendations for its use 
(127). A study of the SCR-584 basic trainer as a device for teaching range 
tracking was presented by Lindsley et al (130) and by Anderson et al (9). 
The Lufts Tracking Trainer was described by Hudson and Searle (85). 
The results of developmental work done on the design and construction 
of a director tracking trainer and experiments to determine the effects of 
various fatiguing circumstances on performance were summarized by 
Mead (151). Kappauf (92) reported on phototube scoring devices for 
tracking trainers. An experimental investigation involving a comparison 
between tracking to a fixed hairline and tracking to a rotating hairline was 
presented by Lindsley et al (124). Experiments in training radar operators 
in visual code reception were discussed by Anderson et al (8) and by 
Lindsley et al (125). The use of radar scope movies for briefing and 
reconnaissance purposes was evaluated by Lindsley (122). A study of 
performance was reported by Lindsley and others (131) concerning the 
reactions of radar operators under speed stress. He also described the 
factors determining the accuracy of reading oscilloscope code in a study 
(126) designed to find the speed, width, amplitude, dot-to-dash ratio, and 
letter-code cycles at which code can most accurately be read. 

An extensive group of studies concerning the training of telephone talkers 
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was discussed by Black and Mason (24); by Snidecor, Mallory, and 
Hearsey (210) in relation to the use of mass drill, continuous prompting, 
instruction by skilled men, dramatic recording, criticism, and discussion 
as methods of training telephone talkers for increased intelligibility; by 
Hibbitt and Mallory (81) in relation to an experimental investigation of 
a course for telephone talkers; by Curtis (46) in relation to increasing the 
intelligibility of voice communication by training in voice technic and 
(47) in relation to the use of noise in a training program; by Abrams 
and others discussing the factors determining the intelligibility of speech 
in noise; and by Mason (146) concerning the effects of training on articu- 
lation. Mason and others (147) reported on the indoctrination of air- 
crewmen in voice communication at altitude and (148) on the training 
studies in voice communication. Studies of the effect of pitch on the 
intelligibility of voice communication were discussed by Mason (145). 
The relationship between loudness and intelligibility of airplane inter- 
phone communication was pointed out by both Curtis (48) and Talley 
et al (218). Reports on the analyses of mistakes made in word intelligi- 
bility tests over the T-17 microphone (144) and on the phonetic char- 
acteristics of words as related to their intelligibility in aircraft type noise 
(143) were made by Mason. Intelligibility in relation to various methods 
of holding the T-17 microphone for communication in noise was dis- 
cussed in a report of the Psychological Corporation (177). Talley, Curtis, 
and Haagen (217) reported on a related study on microphone position in 
voice communication. Snidecor (209) dealt with a preliminary study of the 
ability of rated men to judge speaking performance. Anonymous articles 
gave accounts of a study in training Classification Petty Officers to select 
telephone talkers (14) and of a speech interview for the selection of tele- 
phone talkers (13). The final report in summary of the work on the selection 
and training of telephone talkers was made by Mallory and Temple (141). 
An account of the technics and procedures used by the Voice Communi- 
cation Laboratory was presented by Haagen (76). The final summary of 
work on voice communication was given by Black (23). 

Reports on training studies in radio code work include a summary of 
research in Radio Code Project N-107 by Kurtz and Seashore (106) and 
in Project SC-88 by Keller (95); a comparison of training methods at 
two levels of code learning by Keller, Estes, and Murphy (98); reports 
by Keller and Estes (96, 97) on the effectiveness of different types of 
practice in code learning and by Keller (94) on the code voice method of 
teaching; a comparative study of three methods of teaching code in the 
early weeks of the course by Seashore and others (195); and a discussion 
of the standardization of code speed by Kurtz, Seashore, Stuntz, and 
Willits (108). The development of a graduation and rating test for 
Class A radio schools was discussed by the staff of the Psychological Cor- 
poration (176). A group of four studies concerning methods to be used in 
code classes included Seashore and others’ (197) discussion of variation 
of activities to prevent monotony in code classics; their (196) report on 
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the effect of introducing sending code early in the course upon learning 
to receive; Seashore and Stuntz’ (194) manual of activities for reducing 
monotony in code schools; and their (193) experimental study of the train- 
ing of radio operators to copy code thru interference. Seashore and Kurtz 
(192) presented an analysis of the errors made in copying code. 

A miscellaneous group of training studies included a report by Ruch 
and others (186) outlining training procedures and lectures on winch 
operations and presenting a rating form for grading trainees on electric 
winch operation; a critical evaluation by Shuttleworth (201) of the Army 
Specialized Training Program with reference to selection standards and 
the method of “block training”; a discussion by Layman and Boguslavsky 
(110) of the relationship between ability and achievement in the Army 
Specialized Training Program which pointed out that “neither secondary 
schools nor colleges were sufficiently challenging to induce maximum 
relationship between ability and academic achievement in many individual 
instances”; a presentation by Carstater (215) of the Bureau of Naval 
Personnel program for officer training and one by Batchelder (215) for 
enlisted personnel. Feder (53) reported standardization of instruction in 
several Navy schools concerned with elementary electronics training thru 


the construction of an achievement test and the standardizing of procedures 
on the basis of test results. 


Training Devices 


Discussions of studies concerning training devices included Exton’s (52), 
Noel’s (163), and Stott’s (213) accounts of the use of audio-visual aids 
in expediting the Navy training program. Wattles (258) presented the 
results of the teaching of gunnery with aids such as flash cards, films, 
rating sheets, lesson plans, and observation record forms for evaluating 
the instructor. Ullman (227) described the procedures of several night- 
vision training devices. Lanier (109) explained a night lookout trainer 
for use aboard ship. Dresser (49) examined the use of slide films in the 
Navy training program. Witty and Goldberg (265) discussed the use of 
flash cards, training films, film strips, picture portfolios, bulletin boards, 
posters, cartoons, maps, diagrams, charts, and other visual aids in special 
training units in the Army. An anonymous article (90) described the 
shortcuts in learning skills, ways to speed training, study books, manuals 
and lessons, and other aids to military training. Thomas (225) reported 
on the use of animated cartoons in training and indoctrination in the Army. 
A discussion of ship models in classroom instruction and other training 
aids was presented by Viteles and Gorsuch (239). 

Viteles and others (250) presented a discussion of the psychological 
principles involved in the design and operation of synthetic trainers with 
particular reference to anti-aircraft gunnery. Viteles and others (247) 
also described an investigation of the Range Estimation Trainer Device 
5C-4 as a method of teaching range estimation. The use, characteristics, 
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advantages, and disadvantages of all the synthetic trainers used by the 
Applied Psychology Panel projects were compared with training on real 
equipment by Wolfle (268). 


Morale Research 


Discussions concerned with the factors affecting morale included a 
report by Madigan (140) which emphasized the difficulties in army adjust- 
ment and the ways in which morale problems might be countered; one by 
Prattis (173) concerning the morale of the Negro in the Armed Services 
under the treatment received; and one by Evans (51) and one by O'Gara 
(164) pointing out some factors affecting military morale. Blain (25) 
discussed the war neuroses of merchant seamen and the personal and 
morale factors involved in their etiology and prevention. Homans (83) 
reported on the problems in morale and leadership on small warships. 
Woods (269) discussed the morale factors of naval noncombatants; 
Baganz, Mearin, and Woods (15) presented an account of the mental 
mechanisms and morale factors of Naval recruits in training. An anony- 
mous writer (89) summarized the points mentioned by soldiers as the 
features of army life most closely related to morale. A consideration of the 
development of rumors in the service and the ways of checking them was 
presented by Kelly and Rossman (101). 

Discussions of the factors which build morale included a presentation 
by Bassan (17) of factors found valuable in maintaining morale on a 
small combat ship; an account by Smith (204) of the personnel policy 
of the Navy and its relation to morale; a report concerning the problems 
of procurement, training, and morale among members of the Women’s 
Reserve of the U. S. Coast Guard by Stratton and Springer (214); a 
description by Rose (182) of the bases and weaknesses of American mili- 
tary morale in World War II; considerations by Schroeder (191) and 
by Kreinheder (103) of the orientation program in the Army and the 
qualities of good orientation officers; an anonymous article (153) con- 
cerning planned orientation for combat, orientation objectives, and the 
execution of the orientation course; a presentation by Brosin (28) of 
a program for utilizing the marginally unfit in the Armed forces and 
an analysis of the basic principles involved in morale improvement; a 
description of an analysis made of the morale of American occupation 
troops before and after the end of World War II and means of improving 
military morale by Warner (256); and a report by Rottersman (183). 
based on the analysis of 20,000 selectee questionnaires regarding com- 
plaints, on morale as a factor in complaint reduction. 

Civilian research in morale included Allport and Schmeidler’s history 
(6) of a clearing house to aid psychologists in problems of morale. Shils 
(198) discussed the effect of governmental investigation on attitudes and 
morale. Appel and Hilger (12) presented a morale and preventive-psy- 
chiatry program in the Army. Osborn (168) summarized the services of 
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the Morale Branch of the War Department as they affected the recreation, 
welfare, and morale of the American soldier. 


Leadership Studies 


A suggestion was made by Miller (158) that leadership could be taught 
by actual training under officers who are themselves good leaders and 
by experiencing leadership and its problems. A discussion by Metsker (152) 
included material on the mental characteristics of military leadership from 
the standpoints of selecting and training leaders. Mayberry (150) de- 
scribed an interview rating scale and technics employed in evaluating 
leadership qualities of officer candidates. McNassor (138) and Bavelas (18) 
discussed the training of leaders, and MacKechnie (136) reported on the 
development of leadership in small unit commanders. An outline syllabus, 
used as an aid in the Academy’s first course on the psychology of military 
leadership, was presented by the U. S. Military Academy at West Point 
(228). Intangible factors in combat, including teamwork and leadership 
were considered by McLain (137). Garrett and Ligon (66) in a report 
on combat leadership concluded that unless leadership is defined in some 
way which permits direct measurement of specific qualities, research on 
predictive items is likely to be useless. Ligon (116) discussed the problems 
of choosing efficient officer candidates, reports from combat area, and 
interviews with ex-combat officers concerning the characteristics of good 
combat leadership. A study of traits most frequently mentioned for a 
good officer and for distinguishing a good officer from a good enlisted man 
was reported by Heath and Gregory (80). Ageton (5) presented a dis- 
cussion and bibliography on military leadership and training methods. 
The development of a manual for instructors of leadership courses in Offi- 
cers’ Training School was presented by the Staff of the Bureau of Naval 
Personnel (29). The OSS Staff (229) gave an account of the measurement 
of leadership in life situation tests such as the Mined Road, Getting Past 
the Sentry, The Blown Bridge, and Killing the Mayor, where candidates 
were assigned leadership and expected to lead a group of men. 


Proficiency and Achievement 


Problems in the measurement of achievement in Naval Training Pro- 
grams, the types of tests developed and the outcomes of the Achievement 
Examination Program were described by the Staff of the Bureau of Naval 
Personnel (220). A group of reports on achievement included Ryan’s 
(215) and Feder’s (54) discussions of the services provided to Navy 
Training thru achievement examinations; Porter and Harsh’s (215) 
presentation of achievement examinations for elementary enlisted schools; 
Feder and Lawrence’s (215) account of the measurement of achievement 
in the Radio Technician Training Program; and Cruikshank and Darling’s 
(215) description of the Advancement in Rating Examination developed 
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by the Bureau of Naval Personnel. Anderson and ‘others (11) discussed 
vision as related to proficiency in oscilloscope operation. Lindsley (123) 
wrote on the same topic and gave recommendations concerning minimum 
visual standards for radar operators. Ruch (184) evaluated a subjective and 
an objective technic for rating winch operating ability. Prentice (174) re. 
ported a study of the performance of night lookouts aboard ship. Keller 
and Jerome (99) outlined a system for describing progress in receiving 
International Morse Code. The construction and validation of a work readi- 
ness test for distilling plant operators which served as an objective technic 
for evaluation proficiency was described by Covner, Voss, and Wesley (43) 
and by Voss and Wesley (254). 


Criterion Measures 


Discussions of research on criterion measures included Bechtoldt’s (215) 
and Patterson’s (169) articles on the problems of the criterion in predic- 
tion; Sisson’s (203) description of the criterion in Army personnel re- 
search and the results of an exploration of the “nomination” technic as a 
possible criterion of soldiers’ performance, which showed a correlation of 
the order of .50 between scores on a selected test battery for enlisted men 
and high and low “nominations” by fellows for competence; and Vaughn's 
(230) discussion of this same technic which gave evidence of value as a 
criterion in exploratory studies with Navy pilots. 

Methods of obtaining criteria of shipboard competence appeared in a 
discussion by Bechtoldt, Maucker, and Stuit (19). Franzen presented (57) 
a method for selecting the best combination of dichotomous arrangements 
to distinguish a categorical criterion. The effect on prediction of success 
of an increasingly well-defined criterion was described in an article by 
Stuit and Wilson (216). Miller (155) presented a discussion of the selec- 
tion and reliability of a criterion of proficiency in operating the LCVP. 
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CHAPTER VI 


Wartime Research in Psycho-Acousties 
MARK R. ROSENZWEIG and GERALDINE STONE 


Tue PRESSING demand for effective voice communication during the war 
stimulated widespread research in psycho-acoustics—the application of 
psychological methods to problems of acoustics, speech, and hearing. Exist- 
ing communication equipment and technics had to be tested; new equip- 
ment and technics had to be designed. Typical wartime noises had to be 
measured and studied, their effects evaluated, and some of them combatted. 
Human factors in communication had to be determined, utilized, and al- 
lowed for. Some of the studies directed toward these problems are mentioned 
in this chapter. 

Summaries of the extensive research on psycho-acoustics of the Applied 
Psychology Panel are contained in the book, Human Factors in Military 
Efficiency—T raining and Equipment, by Wolfle and others (135) and in 
two reports by Black (18), and by Mallory and Temple (87). The work 
of the Psycho-Acoustic and Electro-Acoustic Laboratories, Harvard Uni- 
versity has been summarized by Miller, Wiener, and Stevens (97). This 
book includes references to relevant work performed at other laboratories 
and considerable background information. 


Voice Communication 


Basic to the investigation of speech material, communication personnel, 
and communication equipment was the method of “articulation tests” (30, 
31, 59, 60, 61, 111, 126). This was a method of testing communication 
systems by determining how well they serve to transmit speech. Carefully 
chosen speech items were employed; the proportion of items correctly re- 
ceived provided an indication of the relative effectiveness of the system. 
For any devices under consideration, as, for example, microphones, ar- 
ticulation tests could be used to indicate the relative effectiveness of differ- 
ent possibilities: carbon microphones, dynamic microphones, and magnetic 
microphones. Alternatives to the formal articulation test were abbreviated 
testing methods (31), subjective appraisal of intelligibility (6, 31), and 
threshold methods for evaluating intelligibility (31). 


Speech Material 


The type of speech material used was found to be an important factor 
in intelligibility. Analyses were made of the phonetic characteristics of 
words as related to their intelligibility (1, 3, 6, 18, 89, 90). Recordings 
of messages made in combat situations were analyzed to provide informa- 
tion about common errors and failures of communication (7, 18). Intelligi- 


642 


I 
exte 
fact 





December 1948 WaRTIME RESEARCH IN PsycHo-AcousTIcs 





bility involves not only the physical characteristics of speech material 
(acoustics spectra) but also such characteristics as the average number of 
sounds per word, the relation of a word to other words in the language, 
and apperceptive variables (97). On the basis of such considerations, there 
were tested and constructed highly intelligible vocabularies, phonetic al- 
phabets, standard forms of command, and lists of call signals and telephone 


directory names. Various procedures for the pronunciation of numerals 
were also tested (1, 2, 3, 5). 


Distortion 


Distortion and interference are also factors in intelligibility. Experiments 
on amplitude distortion were performed with nonlinear circuits which 
either clipped the peaks or the center of the signal, or rectified the signal. 
The effects of each type of distortion on intelligibility were determined by 
articulation tests, and various measures of distortion were compared for 
their relation to the impairment of intelligibility (78). The effects of adding 
noise, both before and after distortion, were studied. Peak clipping (58, 
73, 75, 76, 78, 79, 81, 85) was found to improve the intelligibility of a 
signal if measurement was in terms of peak voltages. Such peak clipping 
may be used to advantage in hearing aids to protect the ear at high in- 
tensity levels, in AM radio transmission to allow continuous 100 percent 
modulation, and in radio telephony to improve intelligibility when static is 
present. Center clipping and rectification, on the other hand, were found 
to be detrimental to intelligibility. 

The effects of frequency distortion on intelligibility were investigated 
with the use of low-frequency cut-off (43, 125), gradual “tilted” cut-offs 
(58), and band-pass filters (39, 40). Various levels and spectra of masking . 
noise were used. Results showed that an adequate speech-to-noise ratio 
should be provided over as wide a frequency range as possible. For ideal 
speech transmission, the frequency range should extend from about 200 to 
7000 eps, and the signal-to-noise ratio at each frequency should be 25 db 
or more. Combined frequency and amplitude distortion was studied with 
speech material that was both “tilted” (i.e., put thru a system with an 
oblique response characteristic having a regular gain per octave) and peak 
clipped (58). 

The quality of speech was found to change at high altitudes (71, 80, 
113). Attempts to improve the intelligibility of speech at high altitude thru 
deliberate frequercy distortion gave little success (96). Modifications of 
equipment resulted in improved performance (83). 


Interference 


Interference was a major problem of military communication, and an 
extensive range of signals and noises was investigated to determine what 
factors contribute to their effectiveness for masking (39, 48, 62, 93, 95, 
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112, 123, 125, 128). The important dimensions of interference are jts 
intensity, frequency, spectrum, temporal continuity, and annoyance value. 
Noise was found to mask best when it is uninterrupted and when it has 
a broad spectrum with a signal-to-noise ratio that is constant at all fre. 
quencies, Greater annoyance is caused by interrupted and high-frequency 
interference. Pure tones do not mask speech effectively, but continuous 
tones of low fundamental and rich in harmonics mask almost as well as 
noise, and they are more annoying. Speech can be used to mask other 
speech, but its spectrum and not its meaning is the chief factor in masking. 

The effect of interference was investigated also in the case of radio range 
signals (50), and radar signals (53). 


Selecting and Training Personnel 


Since an effective communication system requires good talkers and listen- 
ers, research was conducted in selection and training of personnel. Studies 
were made on the rating of talkers (6, 15, 100, 109, 110, 120, 121). Voice 
factors found to be closely related to intelligibility were loudness (21, 22), 
and intensity control (100), and precise articulation (92). Factors showing 
a slight relation to intelligibility were pitch (91), voice spectrum (6), rate 
of speaking (18), telephone experience, education, listening ability, and 
memory span (100). General American dialect was found to be slightly 
more intelligible than Southern or Eastern (100). 

It was easier to test listeners than talkers, because standardized phono- 
graphic tests could be given to many listeners at once (24, 69, 107). Articu- 
lation tests given under relatively quiet conditions were found not to show 
who will listen well in noise (110). Noise generators were therefore de- 
signed for testing and training programs (11, 12, 48, 101, 127). Experi- 
ments suggested the existence of an ability to listen in noise which is in- 
dependent of distortion due to particular equipment of the spectrum of the 
interfering noise, and of the type and mode of presentation of the speech 
material (69). Slight relation was found between listening ability and the 
following measures (69): code ability, intelligence as measured by the 
GCT, auditory memory span (4), and speaking ability. Listening ability 
was found to be somewhat related to region of residence (18). Experiments 
were also made to determine the nature and extent of individual differences 
in the detection of small changes in noise; tests were constructed for dis- 
crimination of pitch and loudness of noises (68). These have not yet been 
used in studying listening ability. 

Rapid and extensive improvement of performance was obtained with 
training of both talkers and listeners (18, 110). Several training programs 
were devised and tested (5, 16, 23, 24, 64, 88, 92, 99, 122) ; manuals and 
syllabi were prepared (8, 9, 10, 14, 99) ; special training equipment was 
designed and used (11, 98). The improvement found was attributed to 
several factors; training in voice technic (23), training in the use of com- 


munication equipment (6, 13, 32, 47, 104, 110, 129), and training in the 
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identification of words that are partially masked by noise or distorted by 
characteristics of the equipment used (24, 110). 


Communication Equipment 


Along with speech material and communication personnel, communica- 
tion equipment was studied in the over-all program of bettering military 
communication. The testing methods used by the Harvard Psycho-Acoustic 
and Electro-Acoustic Laboratories and the results they obtained are re- 
viewed in a technical summary report (97). 

Microphones were tested with the human voice and with artificial “voices” 
(17). The properties of earphones were tested by utilizing the responses of 
listeners; by measuring sound pressures at the ear canal with a probe tube 
(102) ; and by the use of artificial “ears” (17, 132). The pressure distribu- 
tion in the auditory canal was also obtained in a progressive sound field by 
use of the probe-tube technic (133). Earphone cushions, earphone sockets 
and headsets (36, 37, 41, 57, 116), and masks (74) were tested for their 
effects on communication. Measurement was made of physiological noise 
generated under earphone cushions (94). The characteristics of micro- 
phones (56) and noise shields (35, 131), amplifiers (80), radio link (21), 
and receivers (20, 33, 86) were investigated. Studies were made of radio 
equipment, interphone equipment (43, 46, 71, 80, 83), sound-powered 
equipment (38, 54, 134), and radio-range equipment (55). 


Effects of Noise on Psychomotor Efficiency 


One of the first military projects of the Psycho-Acoustic Laboratory was 
to study the effects on psychomotor efficiency of intense noise and vibration 
(123, 128). A battery of psychological, psychomotor, and physiological 
tests was developed and used to evaluate the effects of noise on a wide 
variety of tasks. In some of the experiments subjects were exposed to 115 db 
of airplane noise for seven-hour work days over a one-month period. The 
subjects reported the noise to be disagreeable and tiring, but their perform- 
ance was largely unimpaired by it. They had temporary hearing losses 
following exposure to noise—losses whose extent and duration depended 
upon the over-all intensity of the noise, its spectrum, and length of ex- 
posure. Other reports give fuller information in intense stimulation as the 
cause of temporary deafness, injury of the inner ear, and other physiological 
effects (25, 26). In one of the psychomotor tests the subjects’ coordinated 
serial reaction time showed an increase of 5 percent in the noise. This 
was actually the greatest effect of acoustic stress shown by any of the 
psychomotor tests, and the validity of this result is open to question. No 
other test showed significant decrements of performance due to noise. 
Indeterminate effects of noise were found in the following tests: muscular 
tension, metabolism, breathing, speed of accommodation, saccadic move- 
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ments, body sway, hand steadiness, reversible perspective, and dark adapta- 
tion. No effects of noise were found in these tests: coordinated serial pur- 
suit, serial disjunctive reaction time, fast-speed pursuit rotor, card sorting, 
coding test, and judgment of distance. Vibration caused a considerable 
reduction in visual acuity in every subject. An extensive investigation was 
made of sound as a military weapon (19). Direct use of sound as a weapon 
was found to be impractical because it required too great an expenditure 
of energy. 


Noise Measurement 


Investigation of the effects of noise on communication and psychomotor 
efficiency required measurement of noise intensities and spectra. The prob- 
lems and methods which this entailed have been reviewed by Miller, Wiener, 
and Stevens (97). 


Combatting Noise 


Noise reduction, sound insulation, and aural protective devices were 
used to lessen impairment of communications and to reduce annoyance 
(97). Airplane noise was attenuated by sound-proofing materials (29). 
Sound insulation was accomplished by proper design of earphone cushions, 
sockets, and headsets (34, 42, 45, 49, 57, 114, 116, 118) and by develop- 


ment of special insert tips for use with miniature earphones (117). Pro- 
tection against noise and gun blasts was afforded by special earplugs 
(70, 77, 115). Reception of speech slightly above normal levels was not 
impaired by the use of these earplugs; in noise, audibility of speech was. 
in some instances, improved by their use (72). 


Hearing Loss and Hearing Aids 


Several tests were developed for the direct measurement of hearing loss 
for speech (67, 105, 106, 108). These tests employ recordings of selected 
words and sentences, the loss being measured from standards set by normal 
subjects. The use of such test material transmitted thru filters may allow 
differential diagnosis of uniform losses and high-frequency hearing losses 
for speech (65). The tolerance of normal and hard-of-hearing subjects for 
intense sounds, both pure tones and speech, was determined. Thresholds 
of discomfort, of tickle, and of pain were measured (119). 

Commercial hearing aids were evaluated on the basis of electro-acoustic 
and psycho-acoustic measurements (97, 102, 103). Studies of design ob- 
jectives for hearing aids (27, 28, 66) indicated the desirability of these 
properties: uniform frequency response between 300 and 4000 cps, limita- 
tion of maximum acoustic output, an effective gain control with a range 


of at least 40 db, and no acoustic feedback or electrical feedback (”squeal”) 
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Auditory Signals for Instrument Flying 


The possibility of putting some airplane instrument indications in audi- 
tory form was investigated (51, 52). Auditory signals were devised which 
“sounded like the behavior of the airplane,” and which did not interfere 


with reception of radio-range signals or voice communication. An “auto- 
matic annunciator” was developed to translate instrument indications auto- 
matically into spoken messages and to announce them to the pilot. The 
annunciator had a readily identifiable speech quality, and there was little 
difficulty in distinguishing between it and outside speech sources. 
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